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To our ancestors and parents, as always 


Preface 


Improving collaborative robot safety by data-driven monitoring and protection is 
critical for robots to be actively useful in human daily life. While some success has 
been demonstrated in structured and static environments that robots are slow, 
clumsy, and not general or robust enough when performing dexterous manipula- 
tions. As humans, doubt and understanding our own limitations, failures, and 
shortcomings is a key for improvement and development. Such knowledge alters 
our behaviors e.g. to execute tasks in a more cautious way. The main reason is 
humans have years of experience to collect data and develop good internal models 
of what happens when interacting with their environment. Equipping robots with a 
set of skills that allows them to assess the quality of their sensory data, internal 
models, used methods, etc. is correspondingly believed to greatly improve the 
overall performance of an autonomous system. 

Statistical models have shown powerful performance for signal processing in 
recent years. Particularly, the nonparametric Bayesian learning methods are widely 
applied to the study of underlying dynamical phenomena generated in fields as 
diverse as robotics, bioinformatics, econometrics, and autonomous systems and 
control. Within these fields, there has been an explosion of multimodal data of 
increasingly complex phenomena, resulting in a push toward building more intri- 
cate multivariate time series models for accessing the quality of those sensory data. 
In recent robotics, human-robot coexistence and cooperation have become an 
urgent demand in modern manufacturing and services industries, resulting in the 
concept of “Human-Robot Collaboration, HRC” and “Physical Human-Robot 
Interaction, pHRI’. Sharing intelligence, having a common behavior, and cooper- 
ating with people to accomplish common tasks (behavior, task, and intelligence) are 
their basic characteristics and elements, and also they have become the consensus of 
domestic and foreign academic circles. The prerequisite of HC is human-robot 
peaceful coexistence, that is, “safety collaboration”. Therefore, the aim of this book 
is to introduce the Nonparametric Bayesian learning methods for robot introspec- 
tion that ensures the robot with longer-term autonomy and safer HRC environment. 
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In this book, the robot introspection relates to safety that has a direct impact on a 
variety of research areas, such as HRC, pHRI, long term autonomy, and many 
others. Long term autonomy can benefit from autonomous anomaly monitoring and 
diagnosis as well as the anomaly recovery strategies. The ability to reason and solve 
its own anomalies, and proactively enrich owned knowledge is a direct way to 
improve autonomous behaviors of a robot. To this end, we start by considering the 
underlying pattern of multimodal observation during robot manipulation, which can 
be well modeled as a parametric Hidden Markov Model (HMM). And then, we 
instead take a nonparametric Bayesian approach in defining a prior through the use 
of the Hierarchical Dirichlet Process (HDP) on the standard HMM parameters, 
named Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM). The 
HDP-HMM can examine an HMM with an unbounded number of possible states 
and allow for flexibility in the complexity of the learned model and for the 
development of reliable and scalable variational inference methods. 

Therefore, although many stable and robust robot control algorithms exist, 
internal modeling errors or external disturbances still affect the robotic systems, 
such as human collision, object sliding, and tool collision. The pHRI scenarios are 
the unstructured and non-standardized dynamic environments that cannot be 
completely modeled and analyzed. To endow robots with longer-term autonomy 
and safer HRC environment, it is necessary to model the real-time multi-modal 
signals for implementing the accurate behavioral introspection and anomaly 
recovery policy learning. In this book, the robot multimodal introspection and 
learning are theoretically investigated, and the main contents and achievements are 
as follows: 

Chapter 1—In this chapter, we concisely introduce the definition of robot intro- 
spection that is considered in this book, and explain those interesting topics to be 
addressed. Moreover, the detailed outline of this book is presented. To make the 
following contents clear and easy to follow, we take all the research contents and 
achievements of this book into consideration; a multi-modal fusion-based robot 
introspective and learning system framework named SPAIR (Sense-Plan- 
Act-Introspect-Recover) is proposed based on the traditional robot control frame- 
work Sense-Plan-Act (SPA), which adds the phases of Robot Introspection (behavior 
recognition, anomaly monitoring, anomaly diagnosis) and anomaly recovery. The 
framework intentionally endows the robot with longer-term autonomy and safer 
human-robot collaborative environment, which mainly includes four functional 
modules: (1) directed graph representation of complex robot manipulation tasks; 
(2) robot movement generation and generalization learning; (3) real-time anomaly 
monitoring and diagnosis during robot execution; (4) robot anomaly recovery policy 
learning. 

Chapter 2—In this chapter, to address the problem of multi-modal fusion, we 
abstract its research object as to how to effectively establish and interpret the 
probabilistic model of multivariate time series. The uncertainty of the number of 
hidden states and the high transition frequency between hidden states are two 
critical problems of HMM in modeling multivariate time series, which will greatly 
weaken the modeling performance and time consistency. Based on this, several 
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nonparametric Bayesian models named Sticky Hierarchical Dirichlet Process 
Hidden Markov Models (SHDP-HMM) are investigated to jointly model the robot 
end-effector’s velocity, force/torque, tactile signals of as well as their related sta- 
tistical information (such as mean and variance) during robot manipulation task. 

Chapter 3—In this chapter, to address the problem of learning and generalizing 
complex tasks of robots multimodal fusion and system framework construction, a 
method of combining Dynamic Movement Primitive (DMP) and Finite State 
Machine (FSM) is proposed in this paper, which divides the robot complex 
manipulation task into sequential motion primitives for improving the adaptability 
and diversity of the manipulation tasks, i.e. parameterized directed graph descrip- 
tions. The proposed method is based on the basic theory of robot learning from 
demonstration. 

Chapter 4—In this chapter, nonparametric Bayesian models-based methods with 
the multi-modal fusion for robot real-time motion behavior recognition and 
anomaly monitoring are proposed. First, the normal multi-modal signals of each 
movement primitives are modeled using sHDP-HMM with the help of a parametric 
description of the robot's manipulation task. Then, the robot behavioral recognition 
is implemented by comparing the cumulative log-likelihood values of real-time 
observations. Finally, three types of abnormal thresholds for robot abnormal 
monitoring are proposed when the behavior is known, which include log-likelihood 
value, log-likelihood gradient value as well as mapping relationship between the 
latent state and log-likelihood value. 

Chapter 5—In this chapter, nonparametric Bayesian models-based 
multi-objective classifiers for multi-modal anomaly diagnosis are also investi- 
gated when the multi-modal anomaly is triggered. Additionally, anomalous samples 
are extracted before and after the occurrence of anomaly events according to the 
size of the given window; sHDP-HMM model is learned for each anomaly type 
individually; the optimal model is selected by cross-validation method; anomaly 
diagnosis is tackled by comparing the cumulative log-likelihood values of testing 
samples across all learned models. Besides this, an anomaly diagnosis method by 
the multimodal sparse representation method of anomaly samples is presented. 

Chapter 6—In this chapter, we mainly focus on learning the recovery policy 
based on anomaly monitoring and diagnosis. Typically, two task-level robot 
anomaly recovery policies are, respectively, proposed by learning the experience 
and intention of human treating accidental and persistent anomaly events, which 
include the re-enactment policy for accidental anomalies by using a multinomial 
distribution as well as the adaptation policy for persistent anomalies by the para- 
metric representation of human’s demonstrations. 

In summary, this book mainly presents methods and algorithms for the collab- 
orative robot multimodal introspection via nonparametric Bayesian methods, 
together with the corresponding theoretical analysis and experimental verification 
on two type of robots that perform three kinds of manipulation tasks, including 
HIRO-NX robot performing electronic assembly, Baxter robot performing 
pick-and-place task, and Baxter robot performing kitting experiment. This book is 
written for graduate students, academic, and industrial researchers in the field of 
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nonparametric Bayesian methods, robotics, physical human-robot interaction, 
multivariate time series modeling, and autonomous system design. This book 
covers the clear explanation of the importance of robot introspection in the tread the 
robotic techniques, included but not limited to the task representation of robot 
complex task, the multimodal time series modeling, the anomaly monitoring, and 
diagnosis as well as the anomaly recovery policy during robot manipulation task. It 
provides not only rigorous theoretical analysis on the nonparametric Bayesian 
based methods and algorithms for assessing the quality of multimodal sensory data 
but also many intuitive real-robot experiments with or without cooperating with a 
human. We do hope that this book will benefit the readers, in terms of knowing the 
basic ideas in the nonparametric Bayesian modeling of multivariate time series for 
robot introspection. We also hope that the presented new advancements in the field 
of anomaly monitoring, diagnosis, and recovery could explore new theoretical tools 
and practical applications. 

At the end of this preface, it is worth pointing out that, in this book, some 
distributed methods for robot introspection and their applications are provided. The 
ideas in this book may trigger more studies and researches in next generation of 
robotics, especially nonparametric Bayesian methods based multivariate time 
series modeling. There is no doubt that this book can be extended. Any 
comments or suggestions are welcome, and the authors can be contacted via 
e-mail: hm.wu @giim.ac.cn (Hongmin Wu). 
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Chapter 1 A) 
Introduction to Robot Introspection get 


Abstract In this chapter, we mainly introduce the definition, background, signifi- 
cance, and the start-of-the-art methods of collaborative robot multimodal introspec- 
tion. The current issues of robot introspection are also introduced, which including 
the complex task representation, anomaly monitoring, diagnoses and recovery by 
assessing the quality of multimodal sensory data during robot manipulation. The 
overall content of this book is presented at the end. 


1.1 What is Robot Introspection? 


As humans, doubt and understanding our own limitations, failures and shortcomings 
is a key for improvement and development, such knowledge alters our behaviors e.g. 
to execute tasks in a more cautious way [1, 2]. Due to the humans can learn and 
recognize the environment for a long time through the five senses (visual, acousti- 
cal, gustatory, osphretic, and tactile) and generate corresponding internal models in 
the daily execution of operational tasks [3-7]. Correspondingly, equipping robots 
with a set of skills that allows them to assess the quality of their sensory data, 
internal models, used methods etc. is correspondingly believed to greatly improve 
the autonomous operability and safety performance of an autonomous system, e.g. 
presented in literature [8, 9]. 

In recent years, the ability of robots to learn intrinsic introspective models like 
humans has been favored by robotics researchers [10, 11]. With respect to humans, 
introspection is the process of examining one’s internal state by evaluating the poten- 
tial patterns of multi-modal sensing data [12, 13, 15], as shown in Fig. 1.1. Its 
application lies in giving robots three capabilities: “What’s it doing?”, “How to 
do?”, and “How’s it doing?”. Among them, “What’s doing?” is to solve the prob- 
lem of continuous state estimation during the robot execution; “How’s it doing?” is 
to monitor the real-time state during robot execution, such as normal or abnormal; 
“How to do?” including the robot task planning and decision making on future 
motion. The development and application of robot introspection can effectively 
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(a) The robot performs the process of assembling electronic components 
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Fig. 1.1 Illustration of assessing the underlying dynamics of the multimodal signals that generated 
by a robot performing the electronic assembly at real-time 


(1) estimate, monitor and prevent abnormal events; (2) evaluate the internal state 
of the robot; (3) strengthen the ability to recovery abnormalities; (4) optimize con- 
trol and motion decisions. 

However, robots do not have thoughts neither feelings, only data, hardware and 
algorithms. Therefore, robots can only assess the quality of sensor data, internal mod- 
els, representations, information, perception input etc. Such knowledge can later lead 
to a modification of robots behavior by including the assessed quality score in the 
planning process. Consequently, robot introspection relates to safety, active percep- 
tion, mapping and many topics. These topics have a direct impact on a variety of 
research areas, such as long term autonomy, search and rescue, and many others. 
Long term autonomy can benefit from autonomous failure recovery and active learn- 
ing. For search and rescue, estimation of the confidence of the sensor input and used 
maps is essential for overall risk assessment. Moreover, for a large variety of tasks 
assessing the quality of sensor data, internal models, representation, information 
will directly affect mission success. The ability to reason and solve its own failures, 
and proactively enrich owned knowledge is a direct way to improve autonomous 
behaviours of a robot. 


1.2 Background and Significance 


Traditional industrial robots have played a significant role in standardization and 
automated production by isolating from human, which is difficult to meet unstruc- 
tured, personalized, and flexible non-standard complex task requirements in a 
dynamic environment such as the human robot interactive scenarios that shown in 
Fig. 1.2. Due to they lack of multi-functional and intelligent capabilities that per- 
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(a) Robot in logistics (b) Robot in electronic assembly (c) Human-robot interaction 


Fig. 1.2 Robots are applied in diverse, individualized and flexible non-standard tasks in unstruc- 
tured dynamic environments 


formed and without sufficient external sensing information. In addition, with the 
widespread application and development of collaborative robots [16, 17], robots 
will gradually move from a traditional closed manufacturing environment to a shared 
space that coexists and interacts with human [18], from semi-automatic operation 
tasks to autonomously perform, which inevitably leads to various unpredictable 
anomalies, such as objects slipping, collisions between end-effector and the envi- 
ronment, human collisions, and system abnormalities. Therefore, in order to give 
the robot a longer-term autonomy and a safer human-machine collaborative environ- 
ment, the robot should perform real-time multi-modal fusion modeling to achieve 
accurate introspection of its own movement behavior (identification of movement 
behaviors, abnormal monitoring, and abnormal diagnoses), and anomaly recovery 
are essential in the next generation of intelligent collaborative robots. 

Generally speaking, the robot follows the control framework of the Sense-Plan- 
Act (SPA) in a structured environment. First, the robot observes the surrounding 
environment and builds an internal model [14]. Then, it formulates a task execu- 
tion plan, and finally executes this plan. However, the control framework is difficult 
to meet the increasingly complex robot operation tasks in unstructured dynamic 
environments [20, 23]. In the recent decades, increasing the robot multi-modal 
introspection and anomaly recovery policy after the robot execution would be an 
effective way to address the aforementioned issues. To this end, a novel Sense-Plan- 
Act-Introspect-Recover (SPAIR) control framework is proposed in this book, which 
can provide a theoretical framework and solutions for autonomous robot operation, 
human-computer interaction, and human-machine integration in the future. 


1.3 Issues to be Addressed in Robot Introspection 


This book intends to provide the reader with a comprehensive overview of the newly 
developed methods for improving robot introspection. In recent years, with the rapid 
development of collaborative robots, scholars in the field of robotics and artificial 
intelligence have carried out explorations on learning from human demonstration, 
integration with humans, multimodal sensing, and learning methods and theories 
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of robots. Although lots of research results and research ideas within those topics 
are presented, the difficulties and key issues of multi-modal modeling and its appli- 
cation of robots are not fully considered, and effective theoretical frameworks and 
systematic integration have not yet been formed. The application research of robot 
multi-modal sensing makes the research in this book more valuable. Therefore, this 
book develops a theoretical study of robot multi-modal introspection and learning 
based on non-parametric Bayesian models, aiming to endow robots a longer-term 
autonomy and a safer human-robot collaboration environment. With respect to the 
state-of-the-art status of robot introspection, the following five issues need to be 
addressed urgently: 


e (1) How to assess the quality of internal models, methods and sensor data, 
used by robots and how to alter their behavior upon this information? 


Humans can accurately perceive the state of the outside world or the objects they 
need to operate through the five senses based on past experience and internal models, 
and can self-recognize the advantages and disadvantages to learn and adjust existing 
experiences and models. However, robots can only sensing the world through sensory 
data such that how to model and analyze of multi-modal time series is a difficult 
problem to realize robot introspection. 


e (2) How to represent the complex robot task and generalize to the environ- 
mental changes during unstructured scenarios? 


It’s difficult to complete tasks from the beginning to the end with fixed and pre- 
programmed movements in the case of an unstructured dynamic human-robot col- 
laboration environment. Resulting in encountering fault or abnormality of the task 
because of the unstructured environment, the changing pose of the operating object, 
and the unpredictable robot state. Therefore, how to learn and generalize the robot 
complex tasks from human demonstration for adapting to the changes in the envi- 
ronment is an valuable question. 


e (3) How to implement the anomaly monitoring and diagnoses of multimodal 
time series during robot execution? 


The robot anomalies usually derived from joint encoders, the environmental changes, 
and human interference in the human robot interaction scenarios. Assuming that the 
robot can monitor and classify these anomalies, it will reduce or prevent the robot 
from being potentially injured and improve the safety of human-robot collaboration. 
Due to the diversity and complexity of the types of anomalies, it isn’t possible to 
directly enumerate and model all anomalies, resulting the anomaly monitoring cannot 
be achieved through supervised learning methods. Generally, the robot anomaly 
monitoring were implemented by learning the outstanding difference between the 
normal ones. In other words, anomalies are identified from models of normal robot 
behavior. Making the robot anomaly monitoring and diagnoses should be critical 
factor for realizing the robot longer-term autonomy. 
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e (4) How to learn the recovery policy from human demonstration for specific 
anomalous event? 


Robot anomaly recovery cannot be performed by the traditional robot’s own motion 
planning algorithm under the human robot interaction scenarios. Whereas human 
expectations for robot motion should be taken into consideration. The main challenge 
is how to incrementally learn the robot recovery policy when encounter various 
abnormal events from unstructured human demonstrations. 


e (5) How to develop a stable and extendable software framework for integrat- 
ing the introspective abilities by modeling the multimodal sensory data? 


How to develop a software for robot multimodal introspection that including multi- 
functional modules for complex task representation, multi-modal fusion, anomaly 
monitoring, anomaly diagnoses, and anomaly recovery. It’s another common issue 
to be solved for realizing long-term autonomy and safe human robot collaborative 
environment. 


1.4 System Framework for Robot Multimodal 
Introspection 


1.4.1 Introduction 


With the rapid development of collaborative robots, robots are moving from the tra- 
ditional structured environment of industrial manufacturing to unstructured dynamic 
shared workspaces that coexist with, collaborate with, and integrate with people. 
Although the robot has a more robust and stable control algorithm, the algorithm can- 
not precisely model the unstructured environment, and unexpected events will occur. 
In order to endow the robot with a longer-term autonomy and a safer human-robot 
collaborative environment, the robot must implement self-monitoring and recovery 
capabilities by evaluating multi-modal sensory sensor in real time. 

This book aim to propose a theoretical framework and extendable system platform 
for robot multimodal introspection and learning based on non-parametric Bayesian 
models. It mainly includes the following four aspects: 


Modeling of multimodal time series, details in Chap. 2; 

Learning and representation of robot complex task, details in Chap. 3; 
Multimodal anomaly monitoring, details in Chap. 4; 

Multimodal anomaly diagnose, details in Chap. 5; 

Heuristic incremental learning robot recovery policies, details in Chap. 6. 


The proposed framework divides the complex tasks of the robot into the directed 
graph, and learns the motion primitive model and realizes the recognition of the 
motion behavior for the motion behavior between the nodes in the directed graph; The 
non-parametric Bayesian time series model learns models for anomaly monitoring 
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and anomaly diagnoses models, and compares and analyzes the performance and 
results of anomaly monitoring under three different anomaly threshold conditions. 
A robot anomaly recovery policy is designed at the top of the framework to handle 
anomalies encountered by external disturbances based on the representation of robot 
complex task and introspection. We assume the considered recovery policy should be 
learned incrementally from the context of task. Therefore, the proposed Re-enactment 
and Adaptation recovery policies correspond to accidental anomalies and persistent 
anomalies, respectively. Among them, the reenact policy uses a probability model of 
a polynomial distribution to count the humans decision-making at different anomaly 
situations for the accidental anomalies. Additionally, the robot learns and stores 
its recovery policy, while the motion adaptation uses human intuition to teach the 
recovery of persistent anomalies. 

The anomalous event of robot is unpredictable, so that the recovery policy of 
the robot under different abnormal conditions cannot be planned offline. Therefore, 
learning the recovery policy incrementally is the key to achieving human-robot safety 
collaboration. The framework has fast and robust anomaly monitoring and diagnoses 
capabilities for both initially planned motion primitives and newly learned recovery 
motion primitives. In addition, by using non-parametric Bayesian models to incre- 
mentally learn models from multi-modal sensing data, the stability, reliability and 
fault tolerance of the framework are well improved. 


1.4.2 Modules of Introspective System 


To endow robots a longer-term autonomy and a safer human-robot collaboration 
environment, the robot’s Introspection phase and the Recovery phase were added to 
the traditional robot control framework SPA. The system framework of the multi- 
modal sensing and learning is shown in Fig. 1.3 and named SPAIR. The framework 
mainly includes four modules: (1) directed graph representation of complex robot 
tasks; (2) generalization and identification of robot motion behavior; (3) real-time 
abnormal monitoring and diagnoses during robot execution; (4) learning recovery 
policy for recovering from robot abnormal events. 

In recent years, the directed graph representation of complex tasks of robots has 
been investigated by a large number of scholars, such as Kappler et al. [19], Fox 
et al. [12], Kroemer et al. [21], and Niekum et al. [22]. To address the problem of 
robot task representation, many scholars tried to represent the complex task by a 
series of movement primitives, and realized the contact state estimation and task 
representation of the robot [13, 24-26]. This book assumes that the robot’s tasks are 
represented by a directed manipulation graph that include a set of movement primi- 
tives, and each primitive is composed of nodes: the start node and the target node, and 
along with the transition between different movement primitives is expressed too. In 
order to accurately represent the robot complex tasks and the subsequent recovery 
behavior, the movement primitive is formulated as follows: (1) The behavior of the 
robot is used to represent a complete movement of the robot, as shown in Fig. 1.3. 
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Fig. 1.3 A systematic framework for robot introspection and learning 


The complete process from the “Home” node to the “Pre-pick” node; (2) The robot 
task is usually composed of one or more movement primitives, as shown in Fig. 1.3. 
The “Pick-and-Place” task of the robot consists of four motions The behavior is 
composed of five action nodes, among which the movement primitives are “Home 
— Pre-pick”, “Pre-pick — Pick”, “Pick — Pre-place” and “Pre-place —> Place”. 
In particular, the directed manipulation graph will be updated when the robot 
encounter an abnormal event, which including: ⁄ j indicates a recovery node 
between nodes .% and .V; and node .%;, denotes a recovery node between nodes 
Mj and M. By analogy, nodes for recovery will be generated in the original graph 
structure, and the connection between the nodes (including the recovery node and 
the node of the original task) will generate new motion behavior. From the recovery 
module at the top of Fig. 1.3, it can be known that the manipulation graph ¥ of the 
robot tasks will gradually robust and stable with the increasing recovery behaviors. 
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The nodes in the SPAIR framework will include multiple functions performed in 
parallel, including: the robot movement properties .@, the robot introspective prop- 
erties (J), and the robot sensing external environment properties ”. The -@ in this 
book refer to the robot movement primitives; (J) includes the robot’s movement 
identification, anomaly monitoring, and anomaly diagnoses; (%) represent the visual 
estimation of the object’s pose in the environment. Among them, the robot’s intro- 
spective attributes are mainly responsible for real-time monitoring of robot anoma- 
lies. Once anomalies are detected, the system will classify the anomalies, and the 
system will recovery according to a limited recovery strategy for different types 
of anomalies. If the recovery strategy does not exist, The system will prompt the 
need for human recovery demonstration, and learn and update its recovery behavior. 
According to the diagnoses result, if it is a accidental anomaly, the robot will perform 
the motion redo recovery behavior, as shown in “anomaly — i” and “anomaly — j” 
shown in Fig. 1.3; if it is a persistent anomaly, the robot will execute the motion 
adaptation recovery behavior, as shown in “anomaly — k” and “anomaly — m” in 
Fig. 1.3. 

The proposed framework is built on the ROS-Jndigo! platform, allowing signals 
of different modalities to communicate in the form of topics, and the sampling fre- 
quency of different sensors is fixed to a specific value by synchronous sampling 
(this book uses 10 Hz), and combines multiple open source software and code, such 
as the description of the robotic task by the finite state machine SMACH and the 
ar_track_alvar? for pose estimation of the object. In addition, the system will auto- 
matically collect multi-modal sensor signal data for each motion primitive and the 
subsequent proposed recovery behaviors during the robot’s movement. So that the 
robot can learn experience and generalize movement from the generated data. To this 
end, the relevant code, videos, pictures and usage documents of this book will be all 
open source,* so that others researchers can improve and share it. 


1.5 Summary 


In this chapter, we mainly introduce the definition and significance of robot introspec- 
tion in the background of human-robot collaboration. We inspired from human that 
doubt and understanding our own limitations, failures and shortcomings is a key for 
improvement and development, such knowledge alters our behaviors e.g.to execute 
tasks in a more cautious way. To endow robots a longer-term autonomy and a safer 
human-robot collaboration environment, we referred to the state-of-the-art status of 
robot introspection, five issues are needed to be addressed urgently. To this end, we 
proposed an introspective system framework with four modules, named Sense-Plan- 


"http://wiki.ros.org/indigo. 

*http://wiki.ros.org/smach. 

3 http://wiki.ros.org/ar_track_alvar. 
“https://github.com/birlrobotics/smach_based_introspection_framework. 
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Act-Introspect-Recover (SPAIR) for address the problems including robot complex 
task graphical representation and execution (skill) identification from unstructured 
demonstrations, multimodel time series modelling for robot anomaly monitoring and 
diagnose as well as the recovery policies learning for different anomalies. 
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Chapter 2 R) 
Nonparametric Bayesian Modeling get 
of Multimodal Time Series 


Abstract In this chapter, we take a Bayesian nonparametric approach in defining 
a prior on the hidden Markov model that allows for flexibility in addressing the 
problem of modeling the complex dynamics during robot manipulation task. At 
first, considering the underlying dynamics that can be well-modeled as a hidden 
discrete Markov process, but in which there is uncertainty about the cardinality of 
the state space. Through the use of the hierarchical Dirichlet process (HDP), one can 
examine an HMM with an unbounded number of possible states. Subsequently, the 
sticky HDP-HMM is investigated for allowing more robust learning of the complex 
dynamics through a learned bias by increasing the probability of self-transitions. 
Additionally, although the HDP-HMM and its sticky extension are very flexible 
time series models, they make a strong Markovian assumption that observations are 
conditionally independent given the discrete HMM state. This assumption is often 
insufficient for capturing the temporal dependencies of the observations in real data. 
To address this issue, we consider extensions of the sticky HDP-HMM for learning the 
switching dynamical processes with switching linear dynamical system. In the later 
chapters of this book, we will verify the performances in modeling mulitmodal time 
series and present the results of robot movement identification, anomaly monitoring, 
and anomaly diagnose. 


2.1 Introduction 


Human can perceive the state of the outside world or the objects that need to be 
operated through the five perspectives based on past experience and internal models, 
and can recognize the advantages and disadvantages of self to learn and adjust exist- 
ing experiences and models. However, robots can only understand the outside world 
through sensing data, so modeling and analysis of multi-modal time series is a diffi- 
cult and hot research topic to realize robot perception. The multi-modal information 
during the execution of the robot is a highly complex and dynamic process that incor- 
porates non-linearities such as motion encoders, vision, forces/torques, haptics, and 
sounds. Modeling based on the traditional hidden Markov model has the problems of 
uncertainty of the number of recessive states and rapid conversion of recessive states, 
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and it is difficult to accurately capture the potential modes of multi-modal data. In 
this chapter, based on the Bayesian basic theory and the theoretical derivation of the 
HMM model, combined with the inherent characteristics of the robot’s multi-modal 
sensor information, it proposes that it is more efficient and complicated in terms of 
complex dynamic phenomena, uncertainty, and model parameter learning. The robust 
non-parameterized Bayesian model implements a solution in which the number of 
recessive states of the HMM model is determined by the complexity of the data and a 
high self-transition expected probability of the recessive state. To a certain extent, the 
non-parameterized Bayesian model is Research on robot multi-modal information 
sensing and fusion provides theoretical framework and application guidance. 


2.2 Related Works 


In neurosciences, studies have shown that the different senses of humans can be 
closely combined [1], so that they can implement the complex manipulation tasks 
and environmental perception in daily life. In recent years, inspired by the results of 
this research in neurosciences, scholars have carried out research on the role of mul- 
timodal sensing information fusion for implementing practical manipulation tasks 
in robotics [2, 3]. Fitzpatrick et al. [4] introduced a method of cross-modal verifica- 
tion, which enhances the robot’s perception of multi-modal events by repeating and 
redundancy of sensing information, so that the robot can learn the underlying corre- 
lation between vision and auditory. Wu et al. [5] studied the use of a combination of 
acceleration and sound measurements to detect structural defects in aircraft parts. Su 
et al. [6] introduced a multi-modal event detection framework based on the robot’s 
axis hole assembly task, including the robot’s end pose and haptic sensing signals. 
It can be known from aforementioned instances that multimodal fusion refers to the 
synthesis of sensory data with a certain temporal sequence and spatial relationship 
from multiple different sensors, modeling at different levels and learning its potential 
patterns [7]. Then, the essence of multimodal fusion is to realize the modeling and 
analysis of Multivariate Time Series (MTS). For example, in order to allow the robot 
can perform flexible operations in an uncertain environment, a force/torque sensor is 
mounted on the end-effector [8], a tactile sensor and a acoustical sensor are installed 
on the robot claw, and a visual sensor is placed for monitoring the robot workspace 
[9-11], as shown in Fig.2.1. In this way, the multi-sensor fusion method can be 
used to more comprehensively and accurately monitor environmental characteris- 
tics, eliminate signal variability and information uncertainty, and improve system 
reliability and stability [7]. 

Time series modeling is pervasive in fields as diverse as speech recognition, 
robotics, economics, medicine, multimedia, bioinformatics, and system control [12]. 
In recent years, related modeling methods have emerged endlessly, among which the 
Hidden Markov Model (HMM) [13] is the most popular method that were used in 
speech recognition [14-16], computational biology [17, 18], Machine translation 
[19, 20], cryptanalysis [21, 22], economics [23, 24], and human behavior recogni- 
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Fig. 2.1 Robotic manipulating platforms integrated with multiple sensors 


tion [25, 26], etc. In particular, HMM has been widely used in the field of robotics, 
mainly including process supervision [27, 28], robot state estimation [29, 30], deci- 
sion making [31, 32], and learning robot motion primitive transformation [33, 34], 
etc. In literature [27] indicated that HMM can be used for anomaly monitoring, pro- 
viding a solution for the follow-up work of this book in anomaly implementation. 
Although there have been a lot of successful application cases of HMM, this model 
has two inherent shortcomings [35]: (1) It’s difficult to determine the size of the HMM 
hidden state space (the number of hidden states); (2) HMM follows the Markov chain 
hypothesis (in a given hidden state, the observation data are distributed independently 
of each other) and the traditional Expectation Maximization (EM) algorithm is used 
to infer unknown parameters [36, 37]. This inherent disadvantage greatly weakens 
the accuracy and efficiency of HMM modeling in time series data. 

To address the problems of HMM, non-parameterized Bayesian model [38] and 
HMM of infinite hidden state space [39] were proposed. Yee Whye The et al. [40] 
proposed a Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM). 
HDP-HMM uses the hidden state transition probability distribution and observation 
distribution of HMM. The methods of introducing hyper-parameters to Bayesian a 
priori are added separately, so that the number of hidden states can be learned from the 
complexity of the training data, and is calculated by the Markov Chain Monte Carlo 
(MCMC) method. The Gibbs Sampling (GS) method infers the unknown parameters 
of the model [41, 42]. HDP-HMM uses the full Bayesian method to learn the number 
of hidden states by assuming an infinite state space, it effectively avoids overfitting 
the data and repeated trial and error model selection. From the process of the model 
implementation and experimental results, it can be known that HDP-HMM’s mod- 
eling of hidden states still retains the properties of the original Markov chain, which 
will cause rapid transitions between hidden states, resulting in a lack of the principle 
of time continuity. The so-called time continuity means that the occurrence of events 
can always maintain a certain consistency, which makes it difficult to analyze events 
in the real world. To address the problems of HDP-HMM, Fox et al. [43-45] proposed 
a Sticky Hierarchical Dirichlet Process Hidden Markov Model (SHDP-HMM). The 
sHDP-HMM tackle the rapid transition of hidden states in HMM is to increase the 
self-transition of hidden states by adding the bias of the probability of each hidden 
state transition, so as to meet the requirement of time continuity. The experimental 
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results show that the “sticky” hidden state enhances the consistency modeling of 
time series data, and can achieve more accurate results than HDP-HMM in speech 
recognition and multiple speaker recognition experiments [43]. 


2.3 Theoretical Basis of Bayesian Model 


2.3.1 Multimodal Time Series 


With the development of multi-modal sensing fusion technology and the diversity 
of the environment, except the joint encoders, robots often need to add force/torque 
sensors and tactile sensors to sense the surrounding environment, so that the robot’s 
observation is often multimodal. In general, a multidimensional time series with 
dimension D and length T is represented as follows: 


Yrxp = [y1 6), y2), .-- , Yat), ». YOO), t = 1,2,...,T; 1 <d < D (21) 


where, t represents a time frame, d represents a variable, and y4 (t) denotes the 
observation of the d-dimensional variable at the t-th time step. For ease of writing, 
define the observation vectors for all dimensions at time f as y; € R?. Hence, the 
Eq. 2.1 is expressed as a matrix with T x D, and the matrix is set as an example of 
a multidimensional time series, that is, 


yı yA) y2) . yall) . yol) 
y2 yi(2) y2(2) +- ya(2) -> yo(2) 

ERED he |= : ; : ; (2.2) 
Yr yi(T) yo(T) +++ ya(T) +++ yo(T) 


Each row in Eq. 2.2 represents the observation at time ¢, and each column represents 
the data of a signal. This book uses a non-parametric Bayesian hidden Markov model 
to learn and analyze the multi-dimensional time series [45, 46]. 


2.3.2 Bayes’ Rules 


Assuming two random variables are y and 0 respectively, then according to the 
definition of Bayes’ rule as follows: 


0) p@ 
ply) = == a O (2.3) 
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Where p(0|y) is the posterior probability, p(y|@) is the likelihood probability, p(@) is 
the prior probability, and p(y) is the standardized constant. From Eq. 2.3, the posterior 
probability is proportional to the product of the likelihood probability and the prior 
probability. Bayesian law summarizes the process of Bayesian inference: given the 
likelihood probability and the standardized constant, the posterior probability can 
be inferred from the prior probability. The Bayesian time series model mentioned 
in this book means that the variables in the model are time-dependent, but due to 
the endless time correlation, the model cannot be established, so it is proposed that 
the variables require certain constraints are time-independent such as Markov chain 
constraints. So that two adjacent variables y,,,; and y; can be predicted by Eq. 2.3 


soi Pily) POD (2.4) 
P(r) 


Therefore, if y in Eq. 2.3 is expressed as the random variable and 6 as parameters 
of corresponding probability model, where unknown parameter is accompanied by 
a hyper-parameter, Bayesian theory can be extended to 


6) p(6|A 
pees pyle) pA) (2.5) 
pO) 


From Eq. 2.5, we can know that the parameters of the model are estimated by maxi- 
mizing the posterior probability, that is, the maximum posterior probability estima- 
tion method (Maximum a Posteriori, MAP), i.e. 


OMAP = are mae ply, A) (2.6) 


If the prior probability p(0|A) is a constant, then the maximum posterior estimate is 
equivalent to the Maximum Likelihood (ML), that is, 


Out = arg max p{y|9} (2.7) 
0 


Equations 2.6 and 2.7 are both point estimation methods for model parameters, which 
will widely used to the later chapters. 


2.3.3 Conjugate Prior 


As we all know, the posterior distribution and the prior distribution belong to the 
same type in Bayes’ rule such that the prior distribution and the posterior distri- 
bution are called conjugate distributions. Where the prior distribution is called the 
conjugate prior of the likelihood function. Therefore, the used of conjugate priors is 
often motivated by practical considerations that the parameters @ can’t be directly 
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learned from the observations. Namely, the conjugate priors allow for introducing 
a computationally tractable mechanism for incorporating new observations into the 
posterior distribution of the parameters 6. With the Bayesian framework, we are 
interested in incorporating a prior distribution on the latent model parameter 0 for 
making predication about the new observations. To this end, assuming the associated 
conditional exist, and given N i.i.d. observations, this predictive likelihood is written 
by 


Polina) = f P(y|¥a)p(Ply1,.-., yy, Add (2.8) 
(2) 


To simplify the computational complexity, we only discusses conjugate priors 
that require the use of three similarity functions in subsequent chapters. In particu- 
lar, three common observation model are considered in the hidden Markov model, 
including Multinomial Distribution, Multivariate Gaussian Distribution as well as 
Linear Gaussian Regression Model. 


e Multinomial Observations 


Multinomial Distribution [130, 131] considers a random variable y on a finite 
sample space ~ = {1,..., K}, its probability mass function can be denoted by 
mw = [7 ,..., g], and describes the probability of a string of N observations of y 
taking on values y;,..., Yn by 


N! A 
POs Yoe wl) = a | [a Ne =D On (2.9) 
Ik ke k n 


Where, the notation 6(j, k) indicate the discrete Kronecker delta. When K = 2, this 
distribution is referred to as the binomial distribution. 

The Dirichlet prior distribution is used for formulating the conjugate prior of the 
multinomial distribution. A K-dimentional Dirichlet distribution is the conjugate 
prior for the class of K -dimensional multinominal distribution and is uniquely defined 
by a set of hyperparameters a = [a,..., œg], which has the following form: 


Du eae ag) =p(|a) 


=i eee IE æk— l On > 0 (2.10) 


Where, I (-) represents the standard Gamma function. When K = 2, this distribution 
is referred to as the Beta distribution, which can be denoted by Beta(a, a2). The 
initial value of the distribution in Eq. 2.10 is given by 


Qi 


X rak 


Then, refer to the conjugacy of the Dirichlet distribution, conditioned on N multi- 
nomial observations y;,..., yy, the posterior distribution on z is also Dirichlet that 


[ui] = (2.11) 
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formulated by 


P(T|Y1, Y2; -+ YN» 4) X p(|a)p(V1, Y2,---. Yn |e) x Dir(oy + Ny,..-,0K + Nx) 
(2.12) 
Where, the Ng represents the observed times of observation y, = k. Subsequently, 
using the normalizing constant of the Dirichlet distribution, and substituting to 
Eq. 2.8, one can derive the predictive likelihood to be 


Nk + a 
K 

N+ >> ay 
k=1 


Ply =kly,..., YN, @) = (2.13) 


e Multivariate Gaussian Observations 


Multivariate Gaussian distribution is parameterized by a mean vetor jz and covari- 
ance matrix X, which provides a useful description of continuous-valued random 
variables that concentrate about a given value and have constrained variability [130]. 
This distribution over a sample space y € Y = R® that each observation y is a 
D-dimensional vector can be derived by 


N O; u, X) = plu, X) 


x l T= y= 
P\=50 =) Q-=u) 
(2.14) 
We conveniently denote this multivariate Gaussian distribution by “~ (u, X). Three 
categories of priors about this distribution in terms of the parameters are considered 


in this book, including known covariance, known mean, and both the mean and 
covariance are uncertain. 


1 
~ mP Aa |N2° 


Known Covariance 
For known covariance X, we use the normal prior distribution as the conjugate prior 
on the mean vector u, and represented by “M (uo, Xo). 


Known Mean 

For known mean and only the convariance X is uncertain, the conjugate prior is 
formulated by the inverse-Wishart distribution [130]. The D-dimensional inverse- 
Wishart distribution with two parameters: covariance parameter A and v degrees of 
freedom, and given by 


v+D+1 


WAJ a 


wD D(D-1 k y4+1—i 
27 7 PC) TT ree) 


p(S|A, v) = exp| sirwaz)| (2.15) 


Where, tr(-) denotes the trace of matrix. We write this distribution by J% (v, A), 
and the first moment is given by calculating the expectation value: 


vA 
EE] = pT (2.16) 
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Both Covariance and Mean are Uncertain 

We use the normal-inverse-Whishart distribution as the conjugate prior when both 
the mean and covariance are uncertain. This distribution defines a conditionally 
normal prior on the mean u| X ~ N (0, X/k), and an inverse-Wishart distribution 
on the covariance, X ~ JW (v, A). Therefore, the joint prior for uncertain mean 
and covariance is then given by 


v+D+1 1 =! K Tai 
p(n, X'|k, 0, v, A) x |X| 7 exp [roaz )— 5 —v)y X` (u— ù) 

(2.17) 

Where, « represents the degree of trust in the mean generated by this prior distribution, 

Ò represents the mean of this prior distribution, v represents the degree of trust in 

the variance generated by this prior distribution, and A represents the mean of the 

variance X matrix. We will use the notation NIY (k, Ù, v, A) to represent the 

normal-inverse-Whishart distribution with four parameters. 


e Multivariate Linear Gaussian Regression Observations 


The normal multivariate linear regression model is one in which the observations 
y; € R? can be described as a linear combination of a set of known regressors x; € R” 
with errors accounted for by additive Gaussian noise, written as 


Yi = X10) + +++ + Xindn + ĉi, ei ~ N (0, X) (2.18) 


For considering the temporal correlation in a multimodal system, we may combine a 
set of N observation vectors into a matrix Y = [y, --- yy], the regressors also into a 


matrix X = [xı -- - xy], and the noise terms into E = [e; --- ey] and formulated by 
Y=AX+E (2.19) 
where the notation A = [a] --- ay] is referred to as the design matrix. 


Assuming the noise covariance X is know, the conjugate prior on the design matrix 
A is the matrix-normal distribution [132]. A matrix A € R¢*” has a matrix-normal 
distribution “WV (A; M, V, K) that given by 


d 
(A) = JAI? ter(a-My")V-"(A-M)*) (2.20) 


~ (2nV|? 


where, the M is the mean matrix, and V and K7! are related to the covariance along 
the rows and the columns of A. 

Along with Eqs.2.15 and 2.20, the conjugate prior on the set of parameters A 
and X is the matrix-normal inverse- Wishart prior that places a conditionally matrix- 
normal prior on matrix A given X, that is 


ee (2.21) 


AJE ~ WN (A: M,Z, K) 
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2.4 Hidden Markov Model 


HMM has been a workhorse in pattern recognition able to encode probabilistic state- 
space model. The HMM is a stochastic process where a finite number of latent states 
have Markovian state transitions. Conditioned on the mode sequence, the model 
assumes (discrete or continuous) conditionally independent observations given the 
latent state. Let z; denote the latent state of the Markov chain at time t € T, and 
xj the state-specific transition distribution for state j € K. Given the state z;, the 
observation y; is conditionally independent of observations and states at other time 
steps. Then, the Markovian structure on the state sequence and the observation are 
simply described as: 


PCr 2-1) = Az, POr lz) = FO) t> 1. (2.22) 


where, the state at the first time step is distributed according to an initial transition 
distributions 7; F (-) represents a family of distributions (e.g., the multinomial for 
discrete data, or the multivariate Gaussian for real-vector-valued data); 0,, are the 
state-specific emission parameters and the z).7 is a state path over the hidden state 
space. Therefore, the resulting joint density for T observations is given by: 


T 


P(Zu:7s Yr) = Tozi) POr | z1) I] PLi | 21-1) POX | ze). (2.23) 
t=2 


Then, to calculate the log-likelihood value of an observation vector y,.7 = (y1, y2, 


y3,.--, yr), we first write the joint distribution with the Markov property 
POr|¥-15 -- +s Y1) = POr|Y-1) and derive the joint probability of observation y1:r 
as: 


T 
POLT) = > POY 1r-1) 


t=1 
K 


POY) = >) PE = kly) POr) 
k=1 


(2.24) 


As illustrated in Eq. 2.23, if the state path z).7 is estimated, the maximum probability 
of the observation sequence can be obtained. However, the true state-path is hidden 
from the observations. There are different approaches to compute it: (i) the maximum 
likelihood state at any given moment; however, this approach does not consider 
uncertainty; (ii) a marginal probabilistic representation over hidden states at each 
time step. Here, the log-probability at time ¢ is derived by computing the natural 
logarithm of the sum of exponentials over all hidden states: 


= aœ (k)- B,(k) 
p ot: (k): Bi (k) 2.25 
og p(yı:r) = log Le POLT) ) K 
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where œ; (k) = p(y1-1, Zt), Br(K) = pOr4+1-7 |Z1), represent the forward message pass- 
ing and backward message respectively in the standard forward-backward algorithm 
[37, 47]. 
PE = klye—1) = >) pele = DP Eily) 
J 
a(k) = par = k| Yor) (2.26) 


= ——~ p (yiz = k) pee = k|Yi-1). 
PO+lY11-1) 


The denominator represents the probability of a sequence of observations that can 
be calculated by Eq. 2.24. 


2.4.1 Forward-Backward Algorithm 


To learn unknown parameters, the forward-backward algorithm [48] provides an effi- 
cient message passing procedure on computing node marginals of interest problems 
including filtering p(Zn|y¥1,---, Yn), prediction p(Zn4m\|¥1,---, Yn), and smoothing 
P(Zn|¥1,---, YN), N > n, where, zn denotes the hidden state and y, represents the 
observation. According to the implementation of the belief propagation algorithm in 
Bayesian graphical model, we define a set of forward and backward messages by 


forward © Gilt) Pi, -- -s Yn, Zn) 


(2.27) 
backward : Bn(Zn) = P(n41, sy yn |Zn) 


To address the problem of filtering, we formulate the filtering with the forward 
messages defined in Eq. 2.27. The filtering is simply generated as 


On (Zn) 
Xan (z) 


filtering: D(Zn|ly1,--+; Yn) = (2.28) 


To address the problem of prediction, we can utilize the Markov structure for the 
underlying chain to derive, that is 


Yenin- Pum |Zn+m-1) MER Xo, P(Zn+1 |Zn)Qn (Zn) 
Xæ (Z) 


P(ZntmlY1,-+++ Yn) = (2.29) 


From Eq. 2.29, it shows that the prediction is equivalent to propagating the forward 
message without incorporating the missing observations y,11,..-, Yn+m- However, 
we utilize both of the forward and backward messages for addressing the problem 
of smoothing for any m by 
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POI +++ YwlZn) Pn) 
POI -- -> YN) 
— PY ++ +s YnalZn) Pnt, -- -> YN |n) PKn) 
= POL- Yn) 
= On (Zn) Bn (Zn) 
~~ EAn) Bm) 


P(Znly1,---, YN) = 


(2.30) 


Hence, we can referred to the belief propagation algorithm for addressing the 
problems of filtering, predication, and smoothing of HMM. Among of them, we can 
use the forward messages for calculating the marginal likelihood of observations. 


2.4.2 Viterbi Algorithm 


The Viterbi algorithm is proposed for addressing the problem of the most likely state 
sequence to have generated an observation sequence y;,..., yy when given a set of 
HMM parameters, which provides an efficient dynamic programming approach to 
computing this Maximum A Posteriori (MAP) hidden state sequence by 


N 
2= max n’a) pile) | | PCan) Palen) 
Zs ZN n=? 
(2.31) 


iring 


N 
= min tora) = lag) + loner) = tesponiz» 
Rn 


n=1 


From the above, we can note that choosing the MAP sequence is not necessar- 
ily equivalent to choosing the maximum marginal independently at each node by 
Zn = maxp(Znly1,---, yy). The marginal sequence also may not even be a feasible 
sequence for the HMM. 

The Viterbi algorithm can be used to compute the most probable sequence of 
states in HMM based on the dynamic programming principle that the minimum cost 
path to Zn = k is equivalent to the minimum cost path to node z,_; plus the cost of 
a transition from z,_, to Zn = k. Therefore, the MAP hidden Markov model state 
sequence Z,...,Zy can be computed in the following four steps that originally 
presented in literature [45]. Here, we first initialize the minimum path sum to state 
zı = k for each k € {,..., K} by 


n=1: A =k) = logn’ (z = k) — logpOile = k) (2.32) 


Forn = 2,..., N, and for each k € {1,..., K}, calculate the minimum path sum th 
state Zn = k by 
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n = {2,...,N}: Fan =k) = — logp(yalén = k) + mint Sn- n-1) 
Zn-1 (2.3 
— logp(zn = k|zn-1)} 


and simplifies to 


Zin) = arg mint n-i (Zn-1) — logp(Zn = k|Zn—-1)} (2.34) 


Zn-1 


thus, computing the probability of hidden state sequence given observation sequence 
by 
min — log p(z,.--,Zn|¥1,---, YN) = SN (ZN) (2.35) 


Zils ZN ZN 


and then set the notation ĉy = arg min Zy (zy). Finally, the most likely hidden state 
ZN 
sequence given observations can be iteratively calculated by 


Bn = Sny), neE{N-1,..., 1} (2.36) 


We will propose methods that have straightforward connections with the Viterbi 
algorithm for anomaly monitoring during robot manipulation in Chap. 4. 


2.5 Nonparametric Bayesian Hidden Markov Model 


Bayesian methods are widely applied to various fields such as biomedicine, clin- 
ical trials, social science, and economics. The collected data are more and more 
complicated with the rapid development of modern science and technology. These 
complicated data include high/ultrahigh dimensional data, big data, discrete and con- 
tinuous data. The aim of this section is to introduce the newly developed Bayesian 
methods, including Bayesian variable selection (e.g., fixed-dimensional data analy- 
sis and high/ultrahigh dimensional data analysis), Bayesian influence analysis (e.g., 
case deletion method and local influence analysis), Bayesian estimation and cluster- 
ing methods (e.g., fixed dimensional data, high/ultrahigh dimensional data analysis, 
Bayesian network, and Bayesian clustering for big data), Bayesian hypothesis test 
including discrete and continuous random variables, variational Bayesian analysis 
and Bayesian clinical trials including design and dose-finding algorithm, and their 
applications in various fields. 

Moreover, Bayesian nonparametric methods avoid the often restrictive assump- 
tions of parametric models by defining distributions on function spaces, especially, 
for the prior function in Bayesian formulation. If effectively designed, these meth- 
ods allow for data-driven posterior inference. The overview of Bayesian nonpara- 
metric inference in recent decades, see [49-51], and we briefly describe two aspects 
of Bayesian nonparametric methods on the hidden Markov model, including the 


2.5 Nonparametric Bayesian Hidden Markov Model 23 


Dirichlet process and its hierarchical extension, such that formulated the Hierarchi- 
cal Dirichlet Process Hidden Markov Model (HDP-HMM) for modeling multimodal 
observations in robotics. 


2.5.1 Bayesian Nonparametric Priors 


HMMs restrictively assumes a fixed model complexity. In process monitoring, latent 
states can represent robot primitives. It’s clear that not all robot skills have the same 
number of primitives and that even the same skill might have variation when con- 
ducted under different conditions. To introduce flexibility into the number of com- 
puted hidden states, we leverage priors as probability measures. 

HMMs can also be represented though a set of transition probability measures 
G j. Probability measures yield strictly positive probabilities and sum to 1. Consider 
if instead of using a transition distribution on latent states, we use it across emission 
parameters 0; € ©. Then, 


K 
Gj = È ajra (2.37) 


where dg, is the unit mass for mode k at 6. The emission parameter 0 ;, can be evaluated 
at time index t — 1, such that: 


POG) = Gja POrlO) = FO) (2.38) 


So, given 6;, different probability weights are assigned to possible successor can- 
didates 6%. We can also assign a prior to the categorical probability measure G j. 
The Dirichlet distribution is a natural selection due to conjugacy. Thus, the transi- 
tion probabilities m; = [71 ---2;x] are independent draws from a K -dimensional 
Dirichlet distribution: 


p) = Dir(ay,...,@xK), j=1,..., K 
K 

Yom =1 

k 


Additionally, emission parameters are drawn from a base measure H such that, 
p(0;) = H. The Dirichlet process (DP) was used as the base measured instead of 
the Dirichlet distribution. The DP is a distribution over countably infinite probability 
measures Go : © > R*+, where Go(O) >= 0 and fo Go(0)d0 = 1. The DP has a 
joint Dirichlet distribution: 


(2.39) 


P(Go(0;), ..., Go(@x)) = Dir(y H(0;),..., yH(Ox)). (2.40) 


We can further summarize the probability measure as p(Go) = DP (y, H) as: 
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Go = > Breda, p(%) =H, (2.41) 
k-1 


where y is the concentration parameter and H is the base measure over parameter 
space ©. The weights 6; are sampled via a stick-breaking construction: 


Bk = Tz — v), (2.42) 


where p(vk) = (1, y). For succinctness, the stick-breaking process is defined as: 
p(B) = GEM (y). The DP is used to define a prior on the set of HMM transition 
probability measures G j. However, if each transition measure G ; is an independent 
draw from DP (y, H), where H is continuous, like a Gaussian distribution, transition 
measures lead to non-overlapping support. This means that previously seen modes 
(robotic primitives) cannot be selected again. To deal with this limitation, a Hier- 
archical Dirichlet Process (HDP) is used. The latter constructs transition measures 
G; on the same support points (91, 62, ...0x). This can be done when G; is only a 
variation on a global discrete measure Go, such that: 


Go =) Bð p(Bly) = GEM(y) 
k-1 


P(% | Go) = Go (2.43) 


Gj =J jda pOT |e, B) = DP(a, P). 
k-1 


This HDP is used as a prior on the HMM. The implications are a mode complexity 
learned from the data and a sparse state representation. The DP(q, 6) distribu- 
tion encourages modes with similar transition distributions, but does not distinguish 
between self- and and cross-mode transitions. This is problematic for dynamical 
data. The HDP-HMM yields large posterior probabilities for mode sequences with 
unrealistically fast dynamics. Emily B. Fox et al. [52] introduced the sticky parameter 
into the HDP, yielding the sHDP-HMM, thus the transition probability zr; is defined 


as: 
aß + =) 


2.44 
PER (2.44) 


pCt; |a,«, B) = DP (+e, 


The sticky HDP increases the expected probability of self-transitions by an amount 
proportional to « and leads to posterior distributions with smoothly varying dynam- 
ics. Finally, priors are placed on the concentration parameters œ and y, and the sticky 
parameter K. Latent state creation is influenced by «œ and y, while self-transition prob- 
abilities are biased by «. These priors allow to integrate expert knowledge for better 
modeling than the expectation-maximization (EM) algorithm traditionally used in 
HMMs. 
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2.5.2 The sHDP-VAR-HMM 


Many complex dynamical phenomena are not adequately described as conditionally 
independent of the observations given the state z;. That is, the observations are a 
noisy linear combination of some finite set of past observations plus additive white 
noise. In such cases, the dynamical processes can be modeled by an autoregressive 
(AR) process. Particularly, an order r vector AR process, denoted by VAR(r), with 
observations y; € Rf. The VAR(r) system can be considered an extension of the 
HMM; instead of having conditionally independent observations given the hidden 
state sequence, the system has conditionally linear dynamics. Figure4.9 shows a 
graphical model representation and compares it to the sHDP-HMM described in 
previous section. 

The VAR has simplifying assumptions that make it a practical choice in appli- 
cations [52]. The switching regime can be combined with the sHDP-HMM from 
Sect. 2.5.1 to leverage the expressiveness of the VAR system with the ability of non- 
parametric priors to learn the mode complexity of the model. The VAR(r) process, 
with autoregressive order r consists of a latent state z; € R” with linear dynamics 
observed through y; € R¢. The observations have mode specific coefficients and 
process noise as: 


PCZ) =T t>1 
ar (2.45) 
y= > A Yi Fez), pler) = N (0, X). 


i=1 


We see a generative model for a time-series {y1, y2,..., yr} of observed multimodal 
data, a matrix of regression coefficients A® = pa” --- A®] e R], anda mea- 
surement noise X, with a symmetric positive-definite covariance matrix. Given the 
observation data, we are interested in learning the “r‘ A” model order, for which we 
need to infer {A“, ©}. We leverage the Bayesian approach through the place- 
ment of conjugate priors on both parameters for posterior inference. As the mean 
and covariance are uncertain, the Matrix-Normal Inverse-Wishart (MNIW) serves 
as an appropriate prior on the multivariate AR distribution. The MNIW places a 
conditionally matrix-normal prior on A® given X: 


p(A” | X) =. MN (A; M, X, K), (2.46) 

and an inverse-Wishart prior on X: 
p(X) = IW (vo, So). (2.47) 
See [45] Appendix F.1 for a detailed derivation of resulting posterior and definitions 


for variables M and K which are parameters that define the matrix-normal portion 
of the MNIW prior. 
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2.6 Summary 


In this chapter, a Bayesian nonparametric approach in defining a prior on the hid- 
den Markov model that allows for flexibility in addressing the problem of modeling 
the complex dynamics during robot manipulation task is proposed. considering the 
underlying dynamics that can be well-modeled as a hidden discrete Markov process, 
but in which there is uncertainty about the cardinality of the state space. We use the 
hierarchical Dirichlet process (HDP) for examining an HMM with an unbounded 
number of possible states. Subsequently, the sticky HDP-HMM is investigated for 
allowing more robust learning of the complex dynamics through a learned bias by 
increasing the probability of self-transitions. Although the HDP-HMM and its sticky 
extension are very flexible time series models, they make a strong Markovian assump- 
tion that observations are conditionally independent given the discrete HMM state. 
This assumption is often insufficient for capturing the temporal dependencies of the 
observations in real data. Consequently, an extension of the sticky HDP-HMM for 
learning the switching dynamical processes with switching linear dynamical system 
is investigated. As the basis, the detailed formulations and theoretical process of 
those models are explained step-by-step in each sections, and widely used in the 
later chapters of this book. 
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Chapter 3 A) 
Incremental Learning Robot Task geai 
Representation and Identification 


Abstract In this chapter, we present a novel method for incremental learning robot 
complex task representation, identifying repeated skills, and generalizing to new 
environment by heuristically segmenting the unstructured demonstrations into move- 
ment primitives that modelled with a dynamical system. The proposed method com- 
bines the advantages of recent task representation methods for learning from demon- 
stration in into an integrated framework. In particular, we use the combination of finite 
state machine and dynamical movement primitives for complex task representation, 
and investigate the Bayesian nonparametric hidden Markov model for repeated skill 
identification. To this end, a robot should be able to identify its actions not only 
when failure or novelty occurs, but also as it executes any number of skills, which 
helps a robot understand what it is doing at all times. Two complex, multi-step robot 
tasks are designed to evaluate the feasibility and effectiveness of proposed methods. 
We not only present the results in task representation, but also analyzing the perfor- 
mance of skill identification by various nonparametric models with various modality 
combinations. 


3.1 Introduction 


As we all know, a sentence in a language is composed of words according to gram- 
matical rules, and a word is composed of letters according to word formation rules. 
Then, a robot complex manipulation task corresponds to a sentence, the movement 
primitives of the robot coupling correspond to a word, and the movement primitives 
of the robot’s respective degrees can be expressed as letters. Therefore, the robot task 
can be represented by a set of learned movement primitives. Through the description 
of the movement primitives, the movement of the robot can be generalized, and the 
diversity and adaptability of the manipulation tasks can be improved. Based on the 
representation of the movement primitives of the robot tasks, the identification of the 
robot’s movement will facilitate the abnormal monitoring and recovery of the robot 
system. 
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3.2 Related Work 


In recent years, as robots have been widely used in fields such as human interac- 
tion, home services, medical care, and industrial production, to enable non-robot 
professionals and people who lack specialized knowledge to quickly complete robot 
movement programming. Compared to the traditional robot control scheme [18, 23], 
robots Learning from human demonstrations has become mainstream for complex 
task descriptions [1, 3]. Generally, in order to effectively learn and generalize the 
robot’s manipulation tasks, the tasks are often divided into multiple segments of 
movement primitives for learning [4, 6], and thereafter the tasks are described by 
stitching or serializing different movement primitives [7—9]. Therefore, the learning 
methods of robot movement primitives mainly include the following four categories 
[12, 13]: Probabilistic movement primitives, dynamic movement primitives (DMPs), 
and probabilistic movements based on a Guassian Mixture Model (GMM). Proba- 
bilistic Movement Primitive (ProMP) and Stable Estimator of Dynamical System 
(SEDS). 

In particular, GMM has strong noise processing ability and strong robustness for 
human unstructured demonstration trajectories, and it is easy to deal with the problem 
of high-dimensional trajectories. The key implementation process is to first use the 
multivariate Gaussian distribution to model the demonstration trajectory, then use 
the Expectation-maximization (EM) algorithm and nonlinear regression technology 
to learn the model, and finally adjust the input new task configuration to achieve a 
generalized description of the trajectory, as DA Duque et al. presented in [11]. 

This method is embodied in that an arbitrary continuous probability distribu- 
tion can be approximated by a weighted average of a mixture of multiple Gaus- 
sian distributions, and has been widely used in robot motion primitive learning and 
motion generation [14]. Elena et al. [15] learned the trajectories of human demonstra- 
tions through GMM, and estimated the stability of the non-linear multi-dimensional 
dynamic system on the generated trajectories, so that they can not only generate 
trajectories for unknown tasks but also perform online in the presence of external 
disturbances. Adjustment Calinon et al. [16, 17, 19] proposed a trajectory learning 
framework based on variance in the task space, modeled human trajectories through 
GMM, and learned EM algorithms for model unknown parameters. Finally, Gaussian 
Mixture regression model (Gaussian Mixture Regression (GMR) and optimization 
estimator. Peter et al. [20] used the Gaussian Process (GP) method to establish a 
regression model on the trajectory of human demonstration to express motion and 
used KLback-—Leibler Divergence (KL) as an indicator of trajectory generalization 
performance. This learning through probability distribution guarantees the stability 
of motion generation. 

Additionally, Schaal and Ijspeert et al. [13, 22, 24] to study the demonstration 
movement with a dynamic system of nonlinear differential equations. The most 
advanced advantage of this method is that only one demonstration trajectory can 
be used to complete the learning and generate the Adapted movement. At the same 
time, DMP inherits many advantages of dynamic systems, such as conditional con- 
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vergence, robustness to external disturbances, and time independence. Khansari et 
al. [21, 25, 26] aimed at the problem that the motion primitives learned in the GMM 
method could not guarantee its motion stability, and proposed a Stable Estimator of 
Dynamical Systems (SEDS) for nonlinear dynamic systems. The motion primitive 
learning method combines a dynamic system with a probabilistic statistical model 
to obtain its globally stable constraints. Finally, the parameter estimation problem is 
transformed into an optimization problem. 

The robot’s manipulation tasks can be divided into multiple parameterized motion 
primitive, so that the combination and serialization of multiple different motion prim- 
itives can effectively improve the diversity and adaptability of robot tasks. Therefore, 
how to correctly select the next motion primitive after the execution of the current 
motion primitive is another research difficulty of the complex task description of the 
robot. In response to the serialization of motion primitives, Pastor et al. [27] pro- 
posed a motion description based on the Associative Skill Memories (ASM) of DMP 
motion primitives for motion selection, which differ from the control scheme [31]. 
Statistical data such as mean and variance are saved at all times. After completing 
90% of the current motion primitives, use the remaining 10% of the sensing data 
and the first 10% of the data of the candidate primitives to calculate the distance of 
several miles, and the nearest one is executed as the next step. 

Niekum etal. [28, 29] proposed the serialization of robot manipulation tasks by the 
finite state machine method, and described the target points of each motion primitive 
in different coordinate systems, and then described by different coordinate systems. 
The classifier constituted by the target, and finally selects the motion primitive to be 
executed next in the way of classification. Manschitz et al. [4, 9, 30, 32] proposed a 
method for learning task manipulation graphs from Kinesthetic Teaching, and learned 
the conversion probability between various motion primitives from the experience 
of multiple teachings. The method is successfully applied to the task of unscrewing 
a light bulb by a multi-step robot. 

Zhe Suet al. [33] introduced and evaluated a framework to autonomously construct 
manipulation graphs from manipulation demonstrations. Our manipulation graphs 
include sequences of motor primitives for performing a manipulation task as well 
as corresponding con-tact state information. The sensory models for the contact 
states allow the robot to verify the goal of each motor primitive as well as detect 
erroneous contact changes. The proposed framework was experimentally evaluated 
on grasping, unscrewing, and insertion tasks on a Barrett arm and hand equipped 
with two BioTacs. 

In summary, due to the teaching programming and offline programming methods 
of traditional industrial robots can no longer satisfy the representation of multi- 
step and complex robot manipulation tasks. In this chapter, through comparison and 
analysis, methods and theories for describing complex tasks of robots have been 
developed. The research combined the methods of dynamic movement primitives 
(DMP) and finite state machines (FSM) to achieve task representation and serializa- 
tion selection among different movement primitives, an simple pick-and-place task 
was set in [34]. 
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3.3 Graphical Representation of Robot Complex Task 


In the case of unstructured and dynamic environments, the task of the robot is difficult 
to complete from beginning to end with fixed and pre-programmed movements, 
because the uncertainty of the environment and the unpredictability of the state of the 
robot in actual manipulation will cause the task The failure or exception occurred. 
For example, during the movement of the Baxter robot during the pick-and-place 
task, the robot will not be able to adjust the movement to adapt to changes in the 
environment due to changes in the pose of the object. Based on this problem, this 
paper proposes a way to artificially segment a complex task of a robot by describing 
multiple movement primitives, and uses a finite state machine mechanism to construct 
the task to implement a directed graph and motion primitive description of the task. 

A finite state machine is a mathematical model that represents a finite number of 
states and the transition between these states. It is usually represented by a five-tuple: 
M = (Q, qo, >., ô, F), where Q is a non-empty finite set of states; q; € Q represents 
a state; qo is the initial state; X` is the input condition Instruction; ô is a transfer 
function between states. Each state in the state machine stores the related information 
of the past, reflecting the changes before and after the input of the state machine; 
and the state transition indicates that the current state has changed. State transition 
rules are used to describe the necessary conditions for state transition. Generally, 
a directed state transition diagram can be used to describe the working condition 
of a finite state machine, which can clearly indicate the transition relationship and 
transition conditions between different states. As shown in Fig. 3.1, a system includes 
a finite state machine with four states, where each node represents a state g;, and 


Fig. 3.1 Illustrates the graphical representation of FSM state transition 
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each edge represents a transition relationship between states. If the input of state qo 
is a, the movement proceeds to state q1, if the input is b, the movement proceeds to 
state q2, and the other states transition The relationship is the same. The final node 
q3 represents the end state of the system. 

According to the understanding of the finite state machine, the robot’s execution 
task is represented by a multi-segment movement primitive, and its primitive is 
derived from the human’s definition and understanding of the task. Generally, the 
following five factors need to be considered when segmenting tasks: 


(1) Each movement behavior should maintain independent and complete movement 
manipulation; 

(2) The smooth motion trajectory of the robot should be maintained in each move- 
ment behavior; 

(3) The stability of modal information should be maintained in each movement 
behavior; 

(4) A certain length of exercise should be maintained in each exercise behavior; 

(5) If the requirements are met, minimize the number of sports activities. 


Based on the above considerations, the finite state machine description of the 
autonomous pick-and-place task of the Baxter robot in Sect.4.5.3 is implemented, 
as shown in Fig. 3.2. Finally, in order to improve the adaptability and generalization 
ability of operating tasks in an unstructured dynamic environment, this paper uses the 
model learning method to model motion primitives between states on the premise of 
finite state machine description of tasks. In order to be able to adapt to the needs of the 
task and changes in the environment. In particular, in this paper, the motion primitive 
learning method DMP based on dynamic systems is used to model the movement 
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Fig. 3.2 Illustrates the FSM implementation of Baxter robot performing pick-and-place task 
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primitive. Thanks to this method, it only needs to learn from a human demonstra- 
tion trajectory and good generalization. After learning the motion primitives, when 
a new starting amount and target amount are given to each movement behavior, a 
new movement behavior can be generated by referring to the demonstration move- 
ment, and a new manipulation task is generated. The specific parameterization and 
generalization process will be Explained in the next subsection. 


3.3.1 Learning Graphical Representation from Unstructured 
Demonstration 


e Demonstration Collection 


Generally, capturing the demonstrations by receiving the multimodal input from the 
end-user, such as a kinesthetic demonstration. The only restriction on the multimodal 
data is that all the signals must be able to be synchronously recorded at the same 
frequency, i.e. temporal alignment. Additionally, the multimodal data at each time 
step should include the Cartesian pose and velocity of the robot end-effector (in case 
of object-grasping, will along with any other relevant data about the end-effector, 
e.g. the open or closed status of the gripper and the relative distance between the 
end-effector and object.) as well as the signals from F/T sensor and tactile sensor. 
Subsequently, the recorded Cartesian pose and velocity trajectory of the end-effector 
will be referred to as the kinematic demonstration trajectory for controlling the robot 
motion, and the recorded signals of F/T sensor and tactile sensor are applied for 
learning the introspective capacities. 


e Finite State Machine 


Defining the set .@ as the union of all the unique states z € {1,2,..., K} from the 
derived hidden state sequences Z = {%, %,..., Zy,}, a Finite State Machine 
(FSM) that represents the task can designed to be constructed by creating nodes! 
Ni, N2, ..., Nm, where K < m and each element should correspond to the same 
hidden state z. For k € {1,..., m}, each node N; is assigned a set of .G of all 
exemplars that have the same hidden state and executing phase, and the starting and 
goal pose are also recorded with Ns, Ne, respectively. A m x m transition matrix 
T can then be constructed, where each element T; j is originally set to 1 if there 
exists a directed transition from N; to Nj, and O otherwise. We will implement 
the incremental task exploration by modulating this transition value as described 
in Chap. 6 in two ways. Consequently, we hopefully learn the robot complex task 
representation by formulating each state of constructed finite state machine using a 
state-specific movement primitive technique. 


‘Here, the term “node” is used to instead of the default “state” in FSM for avoiding confusing 
hidden state of HDP-VAR-HMM or state of the robot. 
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3.3.2 Robot Movement Primitive Learning 


Learning movement primitive has been commonly applied to solve different robot 
tasks in isolation. This usually requires either careful feature engineering, or a sig- 
nificant number of demonstrations. This is far from what we desire: ideally, robots 
should be able to learn from very few demonstrations of any given complex task, 
and instantly generalize to new situations of the same task, without requiring task- 
specific engineering. In this paper, we explore the Dynamical Movement Primitives 
(DMPs) for achieving such capability for equipping robot with the aforementioned 
generalization capability of movement primitive, which provides a framework for 
describing the dynamical systems as set of nonlinear differential equations. DMPs 
guarantee the stability and convergence by introducing an additional canonical sys- 
tem and also provide simple mechanisms for LfD. Particularly, a discrete movement 
DMP can be formulated by the transformation system, that is 


Tv = K(g — x) — Dv — K (e — xo)s + Kf (8), (3.1) 
TX =V. 
where, the Eq. 3.1 is an extended PD control signal with spring and damping constants 
K and D respectively, position and velocity x and v, goal g, scaling s, and temporal 
scaling factor 7. The scaling term originates from an additional system that controls 
the system’s phase execution by 
TS = —QS, (3.2) 


where, the scalar a can be an arbitrary constant. 

Additionally, the forcing term f (s) in Eq. 3.1 is used to alter attractor point dynam- 
ics and achieve an arbitrary trajectory, which commonly learned from an one-sot 
demonstration [1]. Thus, along with the Eq.3.1, the forcing term f(s) can be for- 
mulated as a phase-dependent linear combination of a set of basis functions y; (s) 
(often use the Gaussian basis function), that is, 


O Ea witils)s 
FOS hag (3.3) 
Ys) = (-hi(s — c:)*) 


where, the basis function 7)(s) is formulated with mean c; and variance h;. Then, the 
forcing item is represented the linear combination of basis functions with variable 
weights w; and normalization constant `, Y; (s). Phase s monotonically varies from 1 
to 0 and controls phase progress by activating Gaussian centered at c;. The decreasing 
phase value ensures that the contribution of the forced term disappears and returns 
to the simpler point attractor dynamics to converge to the target. Thus, the (g — x) 
term in Eq.3.1 that with respect to spatio-temporal scaling, which performs spatial 
scaling for guaranteeing the system can adjust to varying goals immediately. While 
the 7 variable in Eq.3.1 is designed for allowing the system to speed up or slow 
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Fig. 3.3 Illustrates the learning and generalization of dynamical movement primitive 


down the execution of a movement. Generally, the forcing term weights are learned 
from a kinesthetic demonstration with pose x(t), velocity x as well as acceleration 
X with duration T are extracted as in [29]. After learning movement primitive from 
One-shot demonstration, the target forcing turn of an unseen scenario is computed by 
adjusting Eq. 3.1, integrating Eq. 3.2 to convert from time to phase, and substituting 
appropriate values to yield, 


—K(g — x(x)) + Dx(s) + TX(s) 
g— xo l 


Siarget (s) = (3.4) 


Next, the goal is set to g = x(T) and 7 is selected such that a DMP reaches 95% 
convergence at t = T before using standard linear regression to compute the weights 
Wi. 

In summary, when a new starting pose and target pose are given after learning 
the motion primitives, a new movement trajectory can be generated with reference 
to the demonstration motion, thereby improving the diversity and adaptability of 
complex tasks. As shown in Fig.3.3, the DMP model is learned from a human 
demonstration trajectory (blue curve in the figure). When different starting points and 
target points are given, the corresponding motion can be generated by the learned 
DMP model. Track (gray curve in the picture). In addition, the DMP model has 
a good generalization ability, which is conducive to improving the diversity and 
adaptability of robot manipulation tasks in an unstructured dynamic environment. 
In particular, in the generalization process of the SPAIR system proposed in this 
paper, the starting position of each motion primitive is determined by the human 
demonstration movement or the external vision system. 
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3.4 Nonparametric Bayesian Method for Robot Movement 
Identification 


3.4.1 Problem Statement 


As those aforementioned related works of robot complex task representation, the task 
should be composed of multiple segments of movement primitives, and new manip- 
ulation tasks can be generated through the learning and generalization of learned 
primitives. As the planned robot manipulation tasks become more and more com- 
plex, the movement primitive library will increase accordingly, and there will be 
multiple feasible primitives candidate during the transition in the task. At this time, 
the manipulation task described by the finite state machine will select the primitive 
with the highest probability value for execution according to the current observation 
value. By real-time monitoring of the robot’s current behavior, that is, the identifi- 
cation that the robot is in or subsequently selecting that primitive, will help estimate 
the current execution status of the robot, and provide a basis for subsequent chapters 
for robot anomaly monitoring and recovery. 


3.4.2 Gradient of Log-Likelihood for Movement 
Identification 


Assume that N normal multimodal trajectories of each movement are collected for 
learning the common dynamics among them via kinesthetic demonstration or move- 
ment generation. To jointly model those trajectories using the Bayesian nonparamet- 
ric hidden Markov model, particularly, the HDP-VAR-HMM (see details in Chap. 2) 
with linear Gaussian observation model is investigated, where each trajectory n € N 
is represented as a multivariate time series that including multidimensional obser- 
vation data y, = [y1, y2,..-, yr] at time t. Take a robot common manipulation task 
as example, y; € R? could be an instant of wrench force, torque, and end effector 
velocity, where d is the total dimensionality of observed features. In HDP-VAR- 
HMM, each observation y; is interpreted by a positive integer, named hidden state 
Z = k € {1,2,3,...}. A countably infinite set of consecutive integers z; that should 
be satisfied with the requirements that an initial probabilistic state distributions 7o 
and a infinite transition distribution {7}7°, are generated using Markovian method. 
Consequently, the state sequence for allt > 1 with Markovian structure can be intu- 
itively formulated as 

Zt Tz 


Zt-1? 


pi =k) = T0, (3.5) 
Plr = k |z = k) = Ty 
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According to Eq. 3.5, the hidden state z,_; indexes all observations y, are assigned 
with z;_; at time step t. Since that, the observation y; given specific hidden state 
Z, = k can be drew from its corresponding observation likelihood functions. 

Since we assign the emission model of HDP-VAR-HMM as the first-order auto- 
regressive Gaussian likelihood. As such, each observation y, are considered to be 
a noisy linear combination of the previous observation y,, plus additive Gaussian 
white noise e€ and can be expressed as 


Yr = Ak Yi—1 + €z = k), elk) ~ N (0, Xr). (3.6) 


Since the € is assumed with zero mean, each state k should described with two 
dynamic parameters {A,;, Xk}, where, Ag is a d x d matrix of regression coeffi- 
cients that defines the expected value of each successive observation as E[y,|y,_ 1] = 
Axy;-1, and Xz isad x d symmetric positive definite matrix that defines the covari- 
ance matrix of state k. Then, the problem turns into how to calculate the VAR obser- 
vation likelihood? 

Therefore, the Gaussian regression observation model explains a observed data 
pairs of input y; and y,_;, and output z,, which is different from the HMM case. 
Resulting that each input is a vector of length y € R, while each output is a scalar. 
With this dimension reduction, the task segmentation by grouping the unique hidden 
states. In particular, we focus on a generative model for learning the output data 
that depends on y,. As presented in [5], there are various generative models can 
be considered, such as full-covariance Gaussian, diagonal-covariance Gaussian, are 
possible for the observed input data. Consequently, the VAR observation likelihood 
can be directly define by the multivariate Gaussian log-likelihood function at specific 
state Zk, 


log p(yr\Ox, Yr-1) = log pil Ak, Xk, Yr-1) 
= log N (Yil AkYt-1, Xk) 


d 1 
= ——log (2r) — -log| X;|— (3.7) 
5 og (27) 5 og| X| 
1 T 5-1 
z0 — Axy-1) X; (Y: — AkYyr-1). 
We define the parameter space © = {01, 02, . . . , Og} for all the hidden states, where 


Ok = {Ax, Xr} for denoting the parameters of k-th state. Where 0, = represents the 
observation parameters for the trained HMM and the zı:r is a state path over the 
hidden state space. As such, if the state path is estimated, the maximum probability 
of the observation sequence can be obtained. Therefore, given M trained models for 
S skills in the robot manipulation task (M is equal to S when using k-fold cross 
validation for optimal model selection), the optimal model of a skill is used along 
with the standard forward-backward algorithm to compute the expected cumulative 
likelihood of a sequence of observations. 

However, since the true state path is hidden from the observations. For the compu- 
tational convenience, the general approach would be to use the maximum likelihood 
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state at any given moment, but this would neglect uncertainty. Instead, we use a 
marginal probabilistic representation over hidden states at each time step. Thus, 
the log-probability at time ¢ is derived by computing the logarithm of the sum of 
exponentials over all hidden states 


ay, (k)- ne) 
L=l f 3.8 
ag Yow ( pOor) oe 


where a(k) = p(z: = kl yo), (kK) = pOr+1:7r|Z1), are presented the forward mes- 
sage passing and backward message in the standard forward-backward algorithm, 
respectively. 


1 
ay (k) = ————~ p(y |Z = k) ple = kl yt-1), 
P(yrl¥o:r-1) 


PE = klyon) = >> pelea = JP C-1lyou-1). 
J 


(3.9) 


The expected cumulative likelihood is E [log P (yı:r | ©m)] for each trained model 
m € M using Eq. 3.8, where O, represents the parameters of the trained model. That 
is, given a test trial €, the cumulative log-likelihood is computed for test trial observa- 
tions conditioned on all available trained skill model parameters log P (ye, | 0)” 
at arate of 100 Hz. The process is repeated when a new skills € Sis started. Given the 
observable position in the FSM £., we can index the correct log-likelihood I(&, € s) 
and see if the probability density of the test trial given the correct model is greater 
than the rest: 


log P(yE,:¢. | Os) > log P (yet. | Om) 
Va(meMAms#s). (3.10) 


If so, the skill identification is deemed correct and we record the time at which the 
correct classification was flagged. We compute both a classification accuracy matrix 
and the mean time threshold value by cross-validation period. 


3.5 Experiments and Results 


Results for nominal and anomalous skill identification are reported independently 
for clarity. In the case of anomaly monitoring, the trials within the test folds for 
skill identification were randomly intermixed. A 12-dimensional observation vec- 
tor consisting of 6D wrench measurements and 6D pose values at each time step 
t: yt = (fx, fys fe, Mx, My, Mz, X, Y, Z, r, p, y). All the parameters of the nonpara- 
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Fig. 3.4 Experiment 1: the skills of a traditional pick-and-place task by FSM: i Pre-pick, ii Pick, 
iii Pre-place, and iv Place 


metric methods presented in this paper are set to the same values as those found in 
[5], when used for speech recognition classification. 

An autoregressive order r = 2 is used. The parameter mean matrices M and K 
are set such that the mass of the matrix-normal distribution is centered around stable 
dynamic matrices while allowing variability in the matrix values. We start by assum- 
ing the mean matrix M = 0 and setting K = 10 x Im. The inverse-Wishart portion of 
the prior is given by 4% = m + 2 DoF (the smallest integer setting to maintain a proper 
prior). The scale matrix Sp = 0.75 is the empirical covariance of the data set. Also, 
setting the prior from the data can move the distribution mass to reasonable parame- 
ter space values. A Gamma(a, b) prior is set on the sHDP concentration parameters 
a + k and y. A Beta(c, d) prior is set on the self-transition parameter p. We choose 
a weekly informative setting and choose: a = 1, b = 0.01, c = 10, d = 1. Finally, 
the Gibbs sampling parameters truncation level and maximum iterations are set to 
20 and 500 respectively. 

In order to assess the robot introspection ability of the sHDP-VAR(r)-HMM, two 
real-robot platforms were tested: (i) a Baxter robot for a pick-and-place task, as shown 
in Fig. 3.4 and (ii) a HIRO-NX robot for the snap assembly task, as shown in Fig. 3.9. 
Both robots used F/T sensors (Robotiq FT300 and a JR3 respectively) that were 
placed on the robot wrists to measure forces and moments at 100 Hz. End-effector 
pose signals were computed using forward kinematics. Our workstation consisted 
of an 8-core Intel Xeon processor, 16GB RAM, running Linux Ubuntu 14.04 and 
ROS-Indigo. 


3.5.1 Experiment 1: Baxter Robot Performs Pick-and-Place 
Task 


The dual-arm Baxter robot is used for the pick-and-place task. The latter is boot- 
strapped via a finite state machine composed of four skills (see Fig.3.4) and the 
goal is to use the sensed wrench and pose signals in building a statistical model for 
each skill through the proposed sHDP-VAR(r)-HMM method. Figure 3.5 shows an 
example of the wrench signals for each skill (shown in different colors) of this task. 
The signals were segmented into four skills according to the states shown in the 
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Fig. 3.5 Experiment 1: recorded wrench signals in the pick-and-place task. Different colors encode 
different skills 


finite state machine in Fig. 3.4. Notice that as the task is being executed, significant 
patterns emerge within and in-between skills. For instance, if we examine the signals 
for Go-to-Pick-Hover and Go-to-Pick shown in Fig.3.4, we can identify patterns 
that can be encoded with statistical significance using our model. To evaluate the 
performance of the sHDP-VAR(r)-HMM models, we tested 24 samples for each 
skill through leave-one-out cross validation [35]. The optimal model for each skill 
was selected via maximum likelihood estimation. 

For nominal skill identification, we test normally executing skills. Figure 3.6 
shows corresponding signals generated on the basis that for ith skill model output is 
1 if its cumulative likelihood is greater than the rest of the models at each time step, 
else 0. If a model has a maximum output cumulative likelihood value during the last 
time step in a sample, then the value is set to 1, implying this observation belongs 
to that model class (the correct skill identification); otherwise, the firing would be 
0. From the skill intervals shown in Fig.3.5 and the models outputs of in Fig. 3.6, 
the monitoring accuracy for the sHDP-VAR(2)-HMM based method can be readily 
proposed. The same scheme can be inferred for the other skills. 

We assessed the performance of the model by comparing with other similar tech- 
niques. Namely, the sticky HDP-HMM with Gaussian observation model and the 
sHDP-VAR(1)-HMM with a first order autoregressive model. We also show the per- 
formance by using different multimodal signal combinations. A confusion matrix 
is used for accuracy evaluation of skill identification for the different methods and 
shown in Fig. 3.8. Notice from Fig. 3.6, that at the beginning of each skill, the identifi- 
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Fig. 3.6 Experiment 1: sHDP-VAR(2)-HMM model outputs for the pick-and-place task 


cation suffers from low accuracies. This is due to the way in which we cumulatively 
compute the log-likelihood (not unlike humans that often need more information 
before confidently classifying). We compute the average test time to perform correct 
classification and report it as a percentage of total duration for a given skill (see 
Table 3.1). Low percentages imply quick identification, while large ones imply slow 
decisions. 

For anomaly monitoring, we manually induced three types of external perturba- 
tions to collect anomaly data. First we introduce human collisions. Human colli- 
sions cause a pose displacement of a held object and leads to failures in the place- 
ment/assembly of that object. Five failure trials were induced for each perturbation, 
for a total of 15 anomalous trials (and 24 nominal trials). Receiver Operating Charac- 
teristic (ROC) curves were used to measure the discriminative ability of the system 
across wrench and pose sensor-modalities (i.e. wrench and wrench-pose modalities). 
Constant k determines our classification threshold and is optimized to obtain the best 
monitoring performance. Comparisons between unimodal wrench signals and multi- 
modal pose-wrench signals are conducted. Figure 3.7 shows ROC curves evaluate the 
relationship between the false positive rate (FPR) and true positive rate (TPR). For 
any given true-positive rate, multimodal tests resulted in lower false-positive rates 
when compared to unimodal versions. Our method also enjoyed high true-positive 
rate when compared to other introspective methods. 
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Fig. 3.7 Experiment 1: Receiver operating characteristic (ROC) curves for the pick-and-place task. 
The figure shows ROC curves that compare the performance of multimodal and unimodal sensing for 


anomaly monitoring and the performance of our anomaly monitoring method versus other baseline 
methods 
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Fig. 3.8 The identification accuracy of different methods with different multimodal signals in a 
pick-and-place experiment 


3.5.2 Experiment 2: HIRO-NX Robot Performs Snap 
Assembly Task 


In this experiment, so as to show the applicability and superiority of the suggested 
sHDP-VAR-HMM monitoring scheme in a real-world assembly task of industrial 
relevance. A 6 DoF dual-arms HIRO robot with electric actuators and a Robotiq 
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Table 3.1 Average time for correct anomaly identification in a pick-and-place task. The notation 
“pPick” and “pPlace” are the abbreviation from “Pre-pick” and “Pre-place”, respectively 


Method pPick (%) | Pik (%) pPlace (%) | Place (%) 
SHDP-HMM (Wrench) 8.65 1.34 3.38 77.72 
SHDP-HMM (Wrench + Pose) 8.70 1.34 3.41 71.21 
SHDP-VAR(1)-HMM (Wrench) 3.62 1.33 3.45 61.10 
SHDP-VAR(1)-HMM (Wrench + 3.60 1.40 3.41 36.28 
Pose) 

SHDP-VAR(2)-HMM (Wrench) 3.51 1.33 3.39 27.61 
SHDP-VAR(2)-HMM (Wrench + 3.50 1.33 3.39 18.18 
Pose) 
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Fig. 3.9 Experiment 2: the skills in a snap assembly task FSM: i Approach; ii Rotation; iii Insertion; 
iv Mating 


6DoF force-torque sensor attached on the wrist is used to perform a snap assembly 
task of camera parts as shown in Fig.3.9. A custom end-effector holds a male part, 
while a female part is fixed to a table in front of the robot. The tool center point 
(TCP) is placed at the location where the male and female parts make contact. The 
world reference frame was located at the manipulator’s base. The TCP position and 
orientation were determined with reference to the world coordinate frame To. The 
force and moment reference frames were determined with respect to the wrist’s frame. 
OpenHRP [10] executes the FSM and modular hybrid pose-force-torque controllers 
execute the skills. Four nominal skills are connected by the FSM: (i) a guarded 
approach, (ii) an alignment procedure, (iii) a snap insertion with high elastic forces, 
and (iv) a mating procedure. Unexpected events occur during initial parts’ contact, 
(e.g. the wrong parts localization) or during the insertion stage where wedging is 
possible. An example of the recorded wrench signals is shown in Fig. 3.10. 
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Fig. 3.10 Experiment 2: recorded wrench signals in the snap assembly task. Different colors encode 
different skill 
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Fig. 3.11 Experiment 2: sHDP-VAR(2)-HMM model outputs in the snap assembly task 


As the procedure in experiment 1, forty-four real-robot nominal trials and 16 


anomalous trials were conducted. We measured the skill identification and anomaly 
monitoring for the modeling schemes considered in experiment 2. The output of the 
proposed HDP-VAR(2)-HMM scheme is shown in Fig. 3.11. The identification accu- 
racy per robot skill is presented in Fig. 3.13 and the average time for computing correct 
classifications is reported as a percentage of total skill duration (see Table 3.2). And 
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Fig. 3.12 Experiment 2: Receiver operating characteristic (ROC) curves for the snap assembly task 
across multimodal scenarios and baseline methods. Unimodal wrench signals are compared with 
multimodal pose and wrench signals 
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Fig. 3.13 The identification accuracy of different method with variable sensor signals in snap 
assembly experiment 


as in experiment 1, we still compare performance with the same baseline methods. 
Meanwhile, the ROC curves confirm the anomaly monitoring performance which is 
illustrated in Fig. 3.12. It is thus seen, that the proposed process monitoring scheme 
has both accurate and efficient performance in contact tasks. 
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Table 3.2 The average time of computing correct identification in snap assembly task 


Method App (%) Rot (%) Ins (%) Mat (%) 
sHDP-HMM (Wrench) 0.71 0.68 5.35 3.41 
sHDP-HMM (Wrench + Pose) 0.71 0.60 5.35 4.20 
SHDP-VAR(1)-HMM (Wrench) 0.71 0.71 5.35 0.89 
SHDP-VAR(1)-HMM (Wrench + 0.71 0.60 5.33 0.81 
Pose) 

SHDP-VAR(2)-HMM (Wrench) 0.71 0.60 5.34 0.73 
SHDP-VAR(2)-HMM (Wrench + 0.71 0.60 5.32 0.69 
Pose) 


3.6 Summary 


This chapter mainly focuses on the representation of robot complex, multi-step 
manipulation tasks and recognition of movement during online execution. Address 
the issue of task representation, we should give enough consideration to human 
understanding and intention in human-robot interactive manipulation tasks such that 
artificially segment complex tasks into multiple sub-tasks, and adopt dynamics that 
require only one demonstration movement of humans. The dynamical movement 
primitive (DMP) modeling method learns a parameterized primitive for each sub- 
task. When given a task, a combination of finite state machine and movement prim- 
itives is proposed to describe a complex manipulation task. The described sub-tasks 
can use different starting points and ending points than the demonstration as the 
motion basis. The input of the meta-model is used to generate new manipulation 
tasks, thereby improving the diversity and adaptability of manipulation tasks. 

For the problem of robot movement recognition, a method of comparing the 
log-likelihood function gradient values of the observed values at the current moment 
under all the learned nonparametric Bayesian models is proposed. First, it is assumed 
that the manipulation task of the robot is decomposed into a set of movement prim- 
itives. Then, the multidimensional signals of each robot movement primitive under 
normal conditions is collected by repeatedly performing the manipulation task, and 
the probability model is established for each primitive by using the method of Chap. 2. 
Finally, the log-likelihood function gradient values of the observations under each 
model are evaluated in real time to realize the recognition of the movement behavior 
during the execution of the robot. Two evaluation indexes of recognition accuracy 
and recognition efficiency (response efficiency) of movement are applied to the pro- 
posed method on the two tasks of “HIRO-NX robot performing electronic component 
assembly tasks” and “Baxter robot performing autonomous pick and place tasks”. 
The comparisons of parametric and non-parametric models and various modalities 
combinations were performed, resulting in proofing the feasibility and effectiveness 
of the proposed movement recognition method in this chapter. 
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Chapter 4 ®) 
Nonparametric Bayesian Method creek 
for Robot Anomaly Monitoring 


Abstract In this chapter, we introduce an anomaly monitoring pipeline using the 
Bayesian nonparametric hidden Markov models after the task representation and 
skill identification in previous chapter, which divided into three categories according 
to different thresholds definition, including (i) log-likelihood-based threshold, (ii) 
threshold based on the gradient of log-likelihood, and (iii) computing the threshold 
by mapping latent state to log-likelihood. Those method are effectively implement 
the anomaly monitoring during robot manipulation task. We also evaluate and analyse 
the performance and results for each method, respectively. 


4.1 Introduction 


Anomalies, also known as outliers, are observations that deviate too much from 
other observations to cause suspicion that they are caused by different mechanisms. 
Anomaly monitoring is a widely studied problem in machine learning, and has impor- 
tant significance in the fields of network intrusion monitoring, credit card fraud mon- 
itoring, and medical health monitoring. In robot manipulation, abnormality refers to 
abnormal behaviors, events, or situations. Robotic abnormality indicates that differ- 
ent from normal execution or observed data that is different from normal, especially 
in the dynamic environment of human-computer interaction, abnormalities Usually 
stems from changes in the environment and human beings. Assuming that the robot 
can monitor, classify, and repair these anomalies, it will reduce or prevent the robot 
from being potentially injured and improve the safety of human-robot collaboration. 
Due to the diversity and complexity of anomaly types, it is not possible to directly 
enumerate and model all anomalies, and to implement anomaly monitoring through 
supervised learning. In general, research on robot abnormality monitoring is often 
implemented using unsupervised learning methods. In other words, it is abnormal to 
identify abnormal phenomena from the normal behavior model of the robot. 
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4.2 Related Work 


The premise of performing anomaly monitoring is to endow robots with longer- 
term autonomy. To this end, we would like to discriminate between nominal and 
anomalous behavior while the robot executes a task. Researchers usually use two 
different methods to solve the problem of robot multimodal anomaly monitoring [1, 
2]. On the one hand, a probability statistical model is established for each modal 
signal such as Rodriguez et al. [3] classify the success and failure of the robot in the 
assembly task by extracting the magnitude of the single-modal output force signal 
[4]. Ando et al. [5] used the K-Nearest Neighbor (KNN) algorithm to independently 
measure the differences between modalities to monitor the abnormal behavior of 
a mobile robot. Pastor et al. [6] used multi-modal sensing signals to monitor the 
failure of the robot to flip the box with chopsticks. This method first uses DMP to 
describe the motion of each stage, and then models each modality independently 
and combines normal statistical features (e.g. mean, variance, etc.) of the data for 
establishing ASM of each movement primitive to realize abnormality monitoring. 
The occurrence of abnormality is defined as if the signal value at least five times 
within three consecutive seconds is lower than a given threshold. Chu et al. [7] 
proposed an anomaly monitoring approach by training multiple HMM models for 
each modality given the observed values, and then the log-likelihood of each HMM 
was calculated as the feature vector for support vector machine model SVM (Support 
Vector Machine, SVM). 

On the other hand, a probabilistic statistical model is established for all modality 
signals. Clifton et al. [8] used Gaussian Mixture Model (GMM) to model multi-modal 
signals, and realized anomaly monitoring through extreme value theory. Qiao et al. 
[9] proposed the use of a sequence of recessive states in the HMM model to monitor 
abnormal instructions. Zhang et al. [10] proposed a hierarchical HMM (Hierarchical 
HMM, HHMM) for system intrusion anomaly monitoring under the premise of the 
HMM method. This method uses HHMM to model normal data and is reflected by 
anomalies in hidden state sequences. Out the system intrusion event. Park et al. [1 1— 
13] carried out an online anomaly monitoring method based on multi-modal sensory 
signals fusion during robot manipulation, where a parametric HMM (Left-to-right 
HMM) models a multi-modal time series of a robot performing normal motions, 
and proposes abnormal thresholds that change by the execution progress to achieve 
abnormal monitoring. We extend the HMM for anomaly monitoring in the subsequent 
part of this chapter that inspired by the HMM cases. In addition, the proposed method 
uses a variety of complementary modal information to detect a variety of potential 
anomalies in the human-robot interaction process on the system where the robot 
feeds the disabled, such as human abnormal contact and environmental collision. 

Hu et al. [14] proposed an abnormal event monitoring method based on HDP- 
HMM. This method combines the Fisher Kernel into the OCSVM (One-Class Sup- 
port Vector Machine) model, which has the advantages of generating models and 
discriminative models. Dilello et al. [15, 16] first proposed the application of the non- 
parameterized Bayesian time series model (SsHDP-HMM) for abnormal monitoring 
and fault diagnoses during robot operation, using the model to transmit force/torque 
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of the robot to perform normal alignment tasks [17, 18]. Sensing data is used for 
modeling, and the threshold value for abnormality monitoring is calculated from the 
validation data set. Then, the log-likelihood function value of the online observation 
data is calculated based on the learned model. Finally, the robot online is compared 
with the abnormality threshold value for anomaly monitoring. 

In this chapter, we define an anomaly whenever a data stream is structurally 
different from prior learned models [19]. As Pimentel et al. review literature shows, 
many anomaly monitoring approaches have been proposed in the last decade [1]. 
However, few works focused on multivariate time series with spatial and temporal 
dependencies as in our case. In Milacski et al. [20], the method requires the full data 
sequence for processing, thus precluding the online detection capability. Besides 
our previous works [21, 22], others have used non-parametric Bayesian approaches 
to learn HMMs in contact tasks for abnormal detection and recognition. In [16, 
23], DiLello et al. used the sticky hierarchical Dirichlet process prior to learn the 
model parameters of an HMM based on wrench signatures for an industrial robot 
alignment task. The work learned specific failure modes online. The approach is 
robust as it is models not just individual contact events but rather the evolution of 
signals, it does so while incorporating domain knowledge, and adjusting its model 
complexity according to the data patterns presented. Nonetheless, the approach is 
limited by the assumptions of the HMM in which observations are conditionally 
independent given the state. This is often an insufficient condition to capture the 
temporal dependencies of complex dynamical behaviors. In [24], Niekum et al. used 
a beta-process prior on the vector auto-regressive (BP-AR-HMM) to extract the state 
complexity from the data. The BP-AR-HMM has the ability to discover and model 
primitives from multiply related time-series. Robot skill discrimination was reliable 
and consistent; however, this work only modeled robot joint angles in a primarily non- 
contact oriented task. In [25], Hu et al. implemented anomaly monitoring based on 
HDP-HMM models and analyzed the efficiency and effectiveness of beam sampling 
in the HDP-HMM model. This implementation however suffers of low computational 
efficiency. Instead, we explore efficient online inference algorithms instead of the 
conventional sampling algorithms. 


4.3 Problem Statement 


Anomaly monitoring is used to continuously monitor the motion status of the robot 
to identify unexpected events during execution. Based on the recognition of robot 
motion in Chap.3, this chapter describes in detail the abnormal event monitoring 
method for robots based on non-parametric Bayesian models. The proposed method 
is also applicable to the subsequent anomaly diagnoses. The main implementation 
ideas are: first, modeling the normally executed multi-modal data; then, test the nor- 
mal movement behavior data based on this model. In order to obtain the definition of 
the boundary between normal and abnormal, that is, the calculation of the abnormal 
threshold; finally, monitoring the testing samples based on the model and the corre- 
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Fig. 4.1 Illustration of the anomaly monitoring procedure after skill identification 


sponding threshold. In order to better study the theory and methods of multivariate 
time series anomaly monitoring, this chapter proposes the non-parametric Bayesian 
models HDP-HMM and sHDP-HMM in different observation models based on the 
parametric HMM model. An anomaly monitor implemented under the combination 
of modal information and different anomaly threshold methods. The anomaly deci- 
sion of the proposed anomaly monitor on robot motion behavior includes: normal, 
single anomaly, and multiple anomalies. 

An ideal online anomaly monitoring system often requires the use of multi- 
modal fusion methods to respond to different types of anomalous events, such as 
force/torque sensing information that can directly detect robotic collision anomalies 
and tactile sensing information that can directly sense objects. And the ability to 
respond to abnormal event occurrences in a timely manner (high response rate) and 
low false touch rate. In order to meet such requirements, this chapter addresses the 
problem of multi-modal fusion robot anomaly monitoring, and proposes the perfor- 
mance of anomaly monitors with three different probability models, different sensor 
information fusion, and different anomaly thresholds. The overview framework pro- 
posed after skill identification in Chap. 3 as shown in Fig. 4.1. 


4.4 Log-Likelihood for Anomaly Monitoring 
4.4.1 Problem Formulation 


The robot introspection model uses non-parametric Bayesian priors along with a Hid- 
den Markov model (HMM) and either Gaussian or Autoregressive emission models. 
First, HMMs are a doubly stochastic and generative process used to make inference 
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on temporal data. The underlying stochastic process contain a finite and fixed number 
of latent states or modes z; which generate observations Y = {y1, ..., yr} through 
mode-specific emission distributions b(y,). These modes are not directly observable 
and represents sub-skills or actions in a given node of a task. Transition distributions, 
encoded in transition matrix A ;;, control the probability of transitioning across modes 
over time. The model assumes conditionally independent observations given the gen- 
erating latent state. Given a set of observations, the Baum—Welch algorithm is used to 
infer model parameters JT = (A, b). The fixed-modes assumption in HMMs limits 
the model’s expressive power as it is unable to derive natural groupings. Bayesian 
nonparametric priors extend HMM models to learn latent complexity from data [26, 
27]. We use Fox’s et al. sticky-Hierarchical Dirichlet Process (SHDP) prior with an 
auto-regressive switching system [26] to model nominal skills as in our previous 
work [21] and derive more robust failure identification methods in manipulation 
tasks, specially in recovery scenarios. 

Bayesian statistics are combined with the sHDP prior to both learn model com- 
plexity k from the data but also to model the transition distribution of an HMM. The 
sticky parameter in the prior discourages fast-mode-switching otherwise present. 
Consider a set of training exemplars ; = {x1, . . . , Xr } of observed multi-modal data 
T consisting of Cartesian pose and wrench values. Then, a mode-dependent matrix of 
regression coefficients A® = [Ay ... AM] e R¢*@] with an rth autoregressive 
order and d dimensions is used along with a measurement noise X, with a symmetric 
positive-definite covariance matrix. The generative model for the sHDP-AR-HMM 
is summarized as: 


[0.6] 
Go =} brô Bly ~ GEM(y). 
k-1 
%|Go ~ Go. (4.1) 


[0,6] 
Gj = XO 1456, z;læ, 6 ~ DP (a, B). 
k-1 


The probability measure G ;, which models the transition distribution of the modes zr; 
determines the weights (probabilities) of transitioning between modes ôĉg,. To avoid 
fixing the mode complexity, G ; uses a prior Go that is unbounded and can grow with 
the complexity of the data. While G ; uses the same set of modes as Go, G ; introduces 
variations over those points. Go provides support for a possibly infinite space, but 
due to the Dirichlet’s process properties (the Chinese Restaurant Process), a finite 
set of modes is selected. In fact, we can understand the hierarchical specification as 
Gj = DP (æ, Go). 

Different observation models can be included into the HMM. A Gaussian distri- 
bution with different covariance models (full, diagonal, and spherical) are consid- 
ered. For Gaussian models, mode specific means and standard deviations are used 
O= V(b, o’). Additionally, the sHDP-HMM can be used to learn VAR processes, 
which can model complex phenomena. The transition distribution is defined as in the 
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sHDP-HMM case, however, instead of independent observations, each mode now 
has conditionally linear dynamics, where the observations are a linear combination 
of the past r mode-dependent observations with additive white noise. A prior on the 
dynamic parameters {A“, ©} is necessary. A conjugate matrix-normal inverse 
Wishart (MNIW) was used to this end. The generative process for the resulting 
HDP-AR-HMM is then found in Eq. (4.2). 


‘ 
Observation Dynamics: y; = > AC yii + e;(Z;). 


isl 
e, ~ A (0, X). (4.2) 


Mode Dynamics: z? ~ x?) P 
“t-1 
By using the model over a set of multi-modal exemplar data ,, the sHDP-AR-HMM 
can discover and model shared behaviors in the data across exemplars. Scalable incre- 
mental or “memoized” variational coordinate ascent, with birth and merge moves 
[27] is used to learn the posterior distribution of the sHDP-HMM family of algo- 
rithms along with mean values for the model parameters 0 of a given skill, hence 
Os, = (H, A}s,,- 


4.4.2 Skill Identification 


The robot introspection system simultaneously detects nominal skills and anomaly 
events. Models are trained for individual skills to capture dynamics from multi-modal 
observations through vector Tm. Tm consists of end-effector pose and wrench values. 
Scalable incremental or “memoized” variational coordinate ascent, with birth and 
merge moves [27] is used to learn the posterior distribution of the sHDP-HMM along 
with mean values for the model parameters JI of a given skill s. Hence 7, = {z, A}: 
the transition matrix and regressor coefficients. 

Given S trained models for M robot skills, scoring is used to compute the expected 
cumulative likelihood of a sequence of observations E [log P (Y |/T;)] for each trained 
model s e S. Given a test trial r, the cumulative log-likelihood is computed for 
test trial observations conditioned on all available trained skill model parameters 
log P(yr:r,\[1)8 at a rate of 200Hz (see Fig.4.2 for an illustration). In Fig. 4.2, 
note the 4 probabilistic models. Given an indexed position in the graph, an anomaly 
threshold corresponding to that skill’s expected cumulative log likelihood. The figure 
also illustrates how at the end of a skill, all data is reset and restarted. The process 
is repeated when a new skill m is started. Given the position in the graph sc, we can 
index the correct log-likelihood IU/7, = Sc) and see if its probability density of the 
test trial given the correct model is greater than the rest: 
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— Expected Log-likelihood 
— Threshold 
—Sub-Task: skill 01 

—— Sub-Task: skill 02 

— Sub-Task: skill 03 
—Sub-Task: skill 04 


Fig. 4.2 Illustration of cumulative log-likelihood (L-) curves for four skills s in a task. As such, 4 
probabilistic models J, are derived. Given an indexed position in graph Nj, one can generate four 
L-curves. The curve for the correct model, has a value increasing likelihood, while the other graphs 
have value decreasing likelihoods. One can also see how at the termination of a skill, all data is 
reset and restarted. No anomalies are shown in this plot 


log P On, |Meorrect) > log P (Yrer | ITs)u 
Vs(s ESAS £ Se). 


If so, the identification is deemed correct, and the time required to achieve the correct 
diagnoses recorded. At the end of the cross-validation period, a diagnoses accuracy 
matrix is derived as well as the mean time threshold value (these results were also 
reported in [21]). 


4.4.3 Anomaly Monitoring 


Due to the characteristics of abnormality, diversity, and unpredictability in the robot’s 
motion behavior, unsupervised learning is the only way to achieve abnormality mon- 
itoring. In particular, this section proposes the use of log-likelihood function val- 
ues based on non-parameterized Bayesian probability statistical models to imple- 
ment anomaly monitoring. The main reason is that the probability model learned 
from normal data will behave when testing anomalous values The cumulative log- 
likelihood function value decreases sharply. Additionally, the curve of the cumulative 
log-likelihood function value that calculated from normal movement behavior will 
show an increasing trend, and the curve calculated with abnormal motion behavior 
will drop significantly when anomaly occurred. Therefore, a method based on the 
log-likelihood function of the model is proposed to realize the abnormal monitoring 
of the robot’s motion behavior by setting the abnormal lower limit threshold. 
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Anomaly monitoring assumes that the cumulative log-likelihood L of a set of 
nominal skill exemplars share similar cumulative log-likelihood patterns. If so, the 
expected cumulative log-likelihood derived in training can be used to implement 
an anomaly threshold F1. Initially, we consider a likelihood curve generated from 
training data for a given skill s. Then, for each time step in an indexed skill s,, the 
anomaly threshold is set to 


Fl, = w(L) —c *o(L) (4.3) 


where c is a real-valued constant that is multiplied by the standard deviation to change 
the threshold. Here, we are only interested in the lower (negative) bound. Then, an 
anomaly is flagged if the cumulative likelihood crosses the threshold at any time: 
if log P (Yr:r, | Meorrect) < F 1s.: anomaly, else nominal. However, it is found that 
the threshold obtained by Eq. (4.3) makes the abnormal monitor frequently trigger 
falsely (False Positive) during actual application, which means that the normal exe- 
cution condition monitoring becomes abnormal. In addition, the determination of 
constant variables requires tedious interactive verification. Aiming at this problem, 
this section refers to the experience of robot motion recognition, and determines the 
threshold of abnormality by solving the gradient information of each log-likelihood 
function value vector. 

Upon our initial exploration of recovery schemes we noticed that after resetting the 
cumulative log-likelihood observations in the HMM model, false-positive anomalies 
were triggered at the beginning of the skill. Further examination revealed that the 
standard deviation of cumulative log-likelihood graphs during training began with 
small variances but grew over time as shown in Fig.4.3. Given that variances are 
small at the beginning of the task, small variations in observations can trigger failure 
flags. 

A second threshold definition was designed to overcome this situation and used 
in our work instead of F1. As the difference between L and F1 is minimal at first 
the new anomaly threshold F2 (for an indexed skill) is focused on computing the 
derivative of the difference: 


d|L—- Fli, 
dt 


F2,. = (4.4) 


Figure 4.4, illustrates the derivative signal crossing the empirical anomaly threshold 
as anomalies are triggered by external disturbances. 
4.4.4 Experiments and Results 


Three manipulation experiments are designed to test the robustness of the online 
decision making system for anomaly recovery. We use accidental human collisions 
as disturbances for a pick-and-place and open-and-close-drawer tasks, and tool col- 
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Analysis of the Cumulative’s Log Likelihood Standard Deviation 
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e.g. Baxter robot 
8000 performs pick and place task 
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Fig. 4.3 This plot shows how the standard deviation of the cumulative log-likelihoods computed 
during training grows over time. The standard deviation grows as observations show greater variance 
due to the accumulation of error 
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Fig. 4.4 Example of how the anomaly identification metric crosses the anomaly threshold as 
anomalies are triggered by external perturbations during a skill execution 


60 4 Nonparametric Bayesian Method for Robot Anomaly Monitoring 


Fig. 4.5 Two examples (pick-and-place (UPPER) and open-and-close-drawer) in which a human 
collaborator accidentally collides the robot. The introspection system identifies an anomaly (see 
bottom left plots) and triggers a recovery behavior (see the fluorescent node in the graph on the 
right) 


lisions for a pick-and-place task. For each of the three experiments, we test four 
separate conditions: 


(i) no anomalies 

(ii) anomalies without recovery 
(iii) one anomaly caused at each executed skill and 
(iv) multiple anomalies caused at each skill. 


The pick-and-place task consists of 5 basic nodes (not counting the home node 
and the recovery node): pre-pick, pick, pre-pick (returns to an offset position), pre- 
place, and place. The open-and-close-drawer task consists of 5 nodes: pre-grip, grip, 
pull-to-open, push-to-close, and go-back-to-start. The direction and intensity of the 
human collisions was random but all executed under the sense that these are accidental 
contacts as a human user reaches into the workspace of the robot without noticing 
the robot’s motion. We note that while the tasks are simple, the main focus of the 
work is placed on the robustness of the recovery systems given different external 
disturbance scenarios. A dual-armed humanoid robot -Baxter- was used. All code 
was run in ROS Indigo and Linux Ubuntu 14.04 on a mobile workstation with an 
Intel Xeon processor, 16GB RAM, and 8 cores. 

For motion generation, two techniques were used for the different tasks. For the 
pick-and-place task, we used Baxter’s internal joint trajectory action server which 
uses cubic splines for interpolation. The goal target is identified online through image- 
processing routines. For the open-and-close-a-drawer task, we trained dynamic 
movement primitives using kinesthetic teaching. 

In terms of robot introspection, the sHDP-HMM code with “memoization” varia- 
tional coordinate ascent, with birth and merge moves was implemented using BNPY 
[28] and wrapped with ROS. Training used 10-trial batches. Observations used 
13 dimensional vectors composed of 7 DoF pose values (position and quaternion) 
and 6 DoF wrench values. A baseline HMM algorithm was implemented through 
hmmlearn [28] and wrapped with ROS. The anomaly threshold for each skill was 
computed through leave-one-out cross-validation. 
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Figure 4.5 shows a representative image of the Baxter robot attempting a pick oper- 
ation. A human collaborator accidentally collides with robot before a pick action. 
Note that collisions were strong enough to move the current pose significantly from 
the intended path and sometimes collide with other parts of the environment. The 
robot introspection system identifies an anomaly and triggers a recovery behavior. 
The lower left part of the image shows the anomaly F2-metric flagging the anomaly. 
The system then begins recovery as seen in the directed graph on the right (imple- 
mented in ROS-SMACH). Video and auxiliary data for the three experiments under 
the four conditions are available in [21]. 

For results reporting, two markers are provided: (i) A success recovery rate for 
the recovery policy and (ii) an F-score, precision, and recall numbers for assessing 
anomaly identification. In particular, these markers are used under experimental 
conditions (iii) and (iv) conditions described at the beginning of the experiment and 
results section. The first two conditions are use as a baseline and compare with the 
generic recovery behavior success rates. 


(1) Human Collisions 

One anomaly per node: 27 test trials were used for the pick-and-place task and 24 
test trials were used for the open-and-close drawer task. Given that both of the tasks 
have 5 nodes, a total of 5 anomalies were induced in the task, 1 per node. Table 4.1, 
shows the recovery success rate of the task (represented by the F-score) and the 
robustness of the identification system through the precision-and-recall quantities 
for both micro and macro settings. The pick-and-place recovery had an average 


Table 4.1 Results for accidental human and tool collisions. Recovery strategy success rates are 
presented by the F-score, and robustness of the robot introspection system for anomaly is shown in 
the recall and precision settings. We also compare the performance of the more expressive sHDP- 
HMM model with an HMM that serves as a baseline. Generally, the sHDP-HMM model allowed 
for better introspection and thus better recovery rates and better recall 


HMM sHDP-HMM 
Micro Macro Micro Macro 
F Recall} Precision | Recall] Precision |F Recall] Precision | Recall) Precision 


Human collision 


Pick & Place 


(1 An./skill) 


0.88 


0.96 


0.88 


0.97 


0.81 


1.00 


0.82 


1.00 


(Mult. An./skill) 


0.83 


0.97 


0.84 


0.97 


0.84 


0.95 


0.85 


0.95 


Open drawer 


(1 An./skill) 0.80 | 0.80 | 1.00 0.80 | 1.00 0.92 | 0.90 | 0.82 0.90 | 0.85 

(Mult. An./skill) | 0.72 | 0.72 | 1.00 0.72 | 1.00 0.93 | 0.93 | 1.00 0.93 1.00 
Tool collision 

Pick & Place 0.71 | 0.71 1.00 0.71 1.00 0.89 | 1.00 | 0.89 1.00 | 0.89 
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success rate of 85% with a maximum of 88%. The precision was 100% (for both 
macro/micro) indicating strong resistance to false positives. The recall was ~82% (for 
macro/micro) indicating the existence of some false-negatives. This might indicate 
that there might have been some collisions that were of lower magnitudes than the 
ones we might have trained with and were not detected by the system. For the drawer 
task, the recovery success rate was 91.67%. The precision was 95% (macro/micro) 
and recall was ~84.5%. As with the human collision, weaker contact signals in tests 
compared to training might be the reason for the presence of false-negatives in our 
system. 

Multiple anomalies per node: Under this scenario the pick-and-place task ran 
42 test trials and the drawer task ran 30 test trials. Five anomalies were induced 
repeatedly one-after-the other for each node. The pick-and-place recovered 85% of 
the time with a precision of ~95% and a recall of ~84.5% (micro/macro). The drawer 
task recovered 93.3% of the time with a precision of ~ 100%, and a recall of ~93% 
(micro/macro). 


(2) Tool Collisions 

For the tool collision experiment, we only tested recovery in the pick-and-place task. 
The results for this scenario were similar to that of human collision. The recovery 
success rate was ~88.89%, the precision 100%, and the recall 88.89%. 


4.4.5 Discussion 


This work showed the ability of a system to recover from unmodeled and accidental 
external disturbances that can’t be anticipated. Such disturbances will be more com- 
mon in shared human-robot workspaces. Our work demonstrated that our recovery 
strategy in connection with our previous introspection framework recovered 88% 
of the time from accidental human and tool collisions under single-anomaly and 
multiple-anomaly scenarios per node. The results indicate the system can recover at 
any part of the task, even when it is abused and multiple anomalies occur consecu- 
tively. From the videos in [21], we can see that even when in cases where the robot is 
in constant duress, the robot recovers consistently. Such robustness will play a role 
in enabling robots have increasing levels of long-term autonomy. 

We by using a more expressive model to do robot introspection, our recovery 
ability also increased. We will continue to explore improved models that can better 
capture spatio-temporal relationships of high-dimensional multi-modal data. As well 
as looking for representations that scale over time in order to acquire a useful and 
practical library of skill identification and motion generation. 

Yet there is much work to be done. Manual annotations for state-dependency 
are an important weakness. Crucially we wish to move towards modeling human 
understanding for decision making in the midst of robot-object-environment inter- 
actions. Scalability is an important factor in this domain as the system must scale 
to ever growing number of learned tasks. Learning how anomalies and recovery 
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decisions are made and re-used across similar scenarios will be important. The man- 
ual approach will not scale. Adaptability and not only reverting is also important. 
Incremental learning is also key. The current work is limited to reverting. By simply 
revering we don’t model recovery behaviors and also do not learn how to adapt. We 
also cannot handle new scenarios. We must develop action-confidence metrics that 
let us learn new scenarios on demand. 


4.5 Gradient of Log-Likelihood for Anomaly Monitoring 


Based on the implementation in Sect. 4.4, it is difficult to determine the constant 
parameters c in actual applications, which leads to frequent false positives. To address 
the problem, this section explores an anomaly monitoring method based on the log- 
likelihood function gradient value and mathematically verifies its implementation 
process to prove its feasibility and effectiveness. After identifying the motion behav- 
ior s, this method assumes the parameters of the specific motion behavior and its 
established probability model are known as ©. From Sect.2.4.1, we can know that 
the log-likelihood function value for the observation sequence can be determined by 
the probability in all credible states, and the simplified description is 


N N 
L(t) = log Y ailt) = log X` exp (log a;(t)) (4.5) 


i=1 i=l 


Therefore, given an HMM model © and an incoming multimodal time series Y1., 
the natural logarithm of the filtered belief state associated with the forward model for 
latent state į € {1, 2, ..., N} can be represented according to Eq. (4.5). To compute 
Z,, we first compute log a; (t). According to the forward-algorithm, we derive: 


ai (1) = mibi (yı), 
N 
a(t + 1) = bi O1) X aj QA; (4.6) 
j=l 
From Eqs. (4.5) and (4.6), we know that loga;(t) can be computed recursively 
through log a,.(t — 1). Expanding the log in Eq. (4.5) we have: 


N 
loga;(t) = log b; (y) + logò exp(logæ;(t — 1) + log Aji) (4.7) 


j=l 
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4.5.1 The Hidden Markov Model Forward Gradient 


e Viterbi Path in HMMs 


The Viterbi algorithm, expanded in Eq. (4.8), attempts to estimate the most likely 
state sequence. Viterbi uses dynamic programming to estimate the underlying state 
sequence Z)., through MAP computations given a sequence of observations yj.;: 


A 


Èi: = arg max P (Zi: | Yir) 
Zi:t 
= arg max( b; (y:) 


arg max (Aziz P (Z111, Yit-1)) ) 


211-1 


= arg max(b-,(y;) (4.8) 


arg max (Az, iz, bz 1-1) 


Zt—1 


arg max (Aziz Tz bz, O1))--)) 


First, we introduce simplified notation for the Viterbi algorithm, where a node j 
has a maximal belief state ô, (j) = maxz, P(Z1:+ = j | Yı) with associated trace- 
back Z;. 


Theorem 4.1 For an incremental time series Y, a good HMM model outputs an 
incremental Viterbi path that stably expands on the previous one. The stable expan- 
sion of the Viterbi path is as follows: given a Viterbi path “11223” for an input Y[1:t], 
then the path at Y(1:t + 1] becomes “11223*”, where * is the newly appended hidden 
state. 


Good models are those that predict their data as accurately as possible and can 
be achieve through two steps: (1) HMM parameters optimization: the Baum—Welch 
algorithm given a proper initialization or similar algorithm incrementally optimize 
HMM parameters until a local maximum of likelihood is reached; minimizing the 
perplexity of the model. (ii) Model selection optimization: this consists in selecting 
the number of hidden states and values for observation models. Many techniques 
exist including BIC, MCMC, Variational Bayes, or non-parametric HMMs [29]. 


Proof Consider, without loss of generality, a Viterbi graph where we examine two 
consecutive time-steps (t — 1, t) along two possible latent states /, k (the analysis 
generalizes to HMMs with more states). We also assume Vi, i = arg max; A;j; that 
is, all hidden states states tend to self-transition. Also at time t — 1: 


ij> 


5:1 (1) > 5;-1(k). (4.9) 
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We also define the following symbol: 
wji(t) = Aji * Dj (Yi). (4.10) 


Due to our first assumption, we have: max; w;;(t) = wi: (t). Then, at time f, the ô 
values are: 


ô, (1) = max (ô;,—1 (L) * wu E), 5-1 (k) * Wei (t)) (4.11) 
ô (k) = max (5,12) * wik (t), 8:1 (k) * Wee (t)) (4.12) 


According to (4.9) and our max weight formulation, the max function in (4.11) is: 
6; (1) = 6;-1(1) * wu (t). So, the max state 7 at time t — 1 will contribute to itself 
instead of k at time t. Therefore, there is only one condition under which the Viterbi 
sequence breaks: 


ô (k) = 6,1 (k) * wet) and 4,(k) > 6,(/) 


In other words, given our original assumption, the Viterbi sequence breaks if the 
following inequalities are met: 


ôr—1 (k) x Wee (t) > 5-1) * w(t) 
and 
ôr—ı (k) x welt) > ôr (D) * w(t) 


In ratio form: 


Wx (t) 2 6-10) 
w(t) —-5;-1(k) 
and 
Wkk (t) = 6-10) 
wit) —-6;-1(k) 


(4.13) 


(4.14) 


If an observation is emitted by state / and it is not undergoing a state switch, 
(ôr (D) > 6,-1(k) and w(t) > wex(t)), inequality (4.14) fails and the Viterbi sequence 
does not break. 

When we transition from state / to k and begin emitting wg (t) > w(t). How- 
ever, the momentum in 6,(k) and 6,(/) prevent their inequality relationship to 
switch. Nonetheless, after p time steps, the inequality 5,_;(/) > 6,-;(k) becomes 
5:-14p() < 5;-14p(k). The latter is reasonable when state k has been emitting for 
some time. It is before this time t — 1 + p that inequalities (4.13) and (4.14) are met. 
To see why, one will notice the left hand side inequalities in (4.13) and (4.14) are 
larger than 1 while the right hand side ratios become smaller than 1 at time t — 1 + p, 
thus their inequality relationships must’ve swapped before this time. Note that the 
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order in which inequalities (4.13) and (4.14) are met matters. If (4.14) is met but 
(4.13) is not, a clean cut occurs since we have: 


5,1 (k) * Wee (t) < 6,12) * Wiz (t) (4.15) 
and 
5,1 (k) x wkk (t) > 6-12) * wi) (4.16) 


Equation (4.16) asserts a switch from / to k. Equation (4.15) states that 46,_;(/) 
contributes to ô (k), implying the previous max state contributes to the next max 
state. And since the max state has changed, the roles of / and k swap and inequality 
(4.14) begins to fail and renders sequence break unattainable. This yields a clean 
transition cut. If (4.13) and (4.14) are met simultaneously, a sequence break occurs. 
But since (4.14) is met, the states switch and the roles of / and k swap and preclude 
a further sequence break. 

If, let’s say at time f,, inequality (4.13) is met but (4.14) is not, a future sequence 
break is destined to happen at the future moment when (4.14) is met, let’s say at 
time tp. When that sequence break occurs, the history between tą and t, flips and 
after the sequence breaks the state transitions from / to k. Again, roles switch and no 
further sequence breaks occur. Above all, we can safely conclude that the during the 
execution of the stable period of a hidden state, no sequence break will occur. It is 
only during state transitions that a sequence could break and even if it does, it only 
last for a single time step—the time step when inequality (4.14) is met. The analysis 
extends to HMM models with more than 2 hidden states, so long as we apply the 
analysis to pairs of hidden states composed of the current max state and another 
non-max state. 


Corollary 4.1 Given Theorem 4.1 the gradient of the log-likelihood of the forward 
algorithm (from now on referred to as the forward gradient) will depend solely on 
the latest emission probabilities and the transition matrix. 


Proof According to the log-exp-sum trick [29], the approximation 
ing this approximation to Eqs. (4.5) and (4.7), which is supported by Theorem 4.1, 
we have: 

L,% max (loga;(t)) (4.17) 


iefl,...,N} 
log æ; (t) ~ log bi (y,)+ 


loga;(t — 1) + log Aj; 
a oes ) + log Aji) 


(4.18) 


Substitute Eq. (4.18) into Eq. (4.17), rename i and j to z; and z,;_;, then recursively 
decompose log a, we have: 
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Z= max (loga,,(t)) 
N} 


zE{l,..., 


= log b; 
max (log b,(y1)+ 


zre{l,..., 


max ane Az, jz, + loga,,_,(t — 1))) 


Za E 


= max „(log bz (y) + (4.19) 


Efl, 


max Kayne Aziza + log bz, (4-1) + 


Zr-1€{1,..., 


gone 


Equation (4.19) is the log version of Eq. (4.8). This suggests that, after approximations 
by Eqs. (4.17) and (4.18), the computation of the log-likelihood is the same as the 
computation of the Viterbi path using the Viterbi algorithm. 

Now, since Theorem 4.1 shows that in general the Viterbi path at time t, 21:;, 
expands on the Viterbi path at time t — 1, Z1.;_1, we have: 


Z= max „(log bz, (y) + 


zE{l,..., 


max mE Aziz tloga,,_,(t — 1))) 


Z-1€{1,..., 


= log bz, (yr) + log Az,_,2, ag L,-1. (4.20) 
Then, the forward gradient can be derived from Eq. (4.20) as: 


VZ, = LZ, — Li = log bz, (Y) + log Az,_,3,. (4.21) 
Equation (4.21) supports Corollary 4.1, where the forward gradient depends on the 
latest emission probability b;, (y;) and transition probability from hidden state Z,_; 
to Z;. Also, given that good HMM models have strong inertia (high probabilities of 
self-transitions), state-switching should be sparse and then Z,_ will equal to 2; most 
of the time. 


e Normal Skill Identification Based on the Forward Gradient 


Corollary 4.1 led us to design a new method for skill identification. If we use n HMM 
models to represent n robot skills, with observations coming from a certain skill, the 
HMM model corresponding to that skill 7, should output a value-increasing forward 
log-likelihood curve that is greater than the rest of the HMM models. This also means 
model m will output a larger forward gradient value compared to other models. 
The forward gradient depends on the latest emission probabilities, which in turn 
depend on the latest observation. The largest probabilities and thus gradients will 
belong to the HMM model of a currently executing robot skill. The key insight 
however is the inverse relationship: the use of the forward gradients to infer the 
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currently executing skill, and then it can validate the strong correlation between the 
forward gradient and skill observations. This is contrasted with the log-likelihoods 
of the observations log P(Y;.,|J7) do not. The forward-gradient measure for skill 
identification is defined as follows: given p skills sı : sp, we have HMM models 
ms for s € {1,..., p} and an input time series Y, then the most probable skill $ 
generating Y[f] is inferred as: 


§ = arg anax, (V2 (Y)) (4.22) 


PPRS 


where, VL?" (Y) is the forward gradient output by model m, at time ¢ computed 
using time series Y. 


4.5.2 Anomaly Monitoring Based on the Forward Gradient 


We now build on the premise established in Eq. (4.22). Furthermore, consider a set 
of nominal observations for an executing skill, we know that the corresponding skill 
HMM model will output a value-increasing forward log-likelihood curve, and hence, 
a stable positive forward gradient. So, when an anomaly occurs, the forward gradi- 
ent decreases significantly as illustrated in Fig. 4.6. Given that anomalies influence 
the forward gradient value, we propose a gradient-based metric for HMM anomaly 


monitoring. 
Consider an HMM model m representing a skill s with n time-series trials Y; 
for i € {1,...,n} collected from successful executions of skills s € S. To detect 


anomalies in a new time series Y we can first derive: 


V Lax = max ( 
ie€{l,...,.n} tef{l,..., 


VLnin = min ( min (Vee a): 


ie€{l,....n} tef{l,..., 


V Zrange = VZ max = V Znin» 


15 
Time (s) 


Fig. 4.6 For each skill (denoted in different background color), the log-likelihood gradient can be 
used for skill identification 
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where T; is the time length of trial Y; and V L” (Y;) is the forward gradient output by 
model m at time t computed using time series Y;. Then, we use an empirically-derived 
test to trigger an anomaly for Y: 


Viet ange 


VZ" Y) < VZnin m 2 


(4.23) 
This test detects if the gradient is an outlier compared with gradients of successful 
skill executions. 


4.5.3 Experiments and Results 


As for experimental setup, a dual armed humanoid Baxter robot was used to perform 
a pick-and-place operation. The robot consisted of a Robotiq force-torque sensor and 
standard Baxter fingers. 5 nominal trials were used for training the HMM model. In 
testing, 5 trials were used for skill identification and anomaly monitoring respectively. 
The pick-and-place task consists of 5 skills: 


(i) hover over the picking position, 
(ii) grasp the object, 
Gii) lift the object, 
(iv) hover to the placing position, and 
(v) place the object. 


Figure 4.7, shows the experimental setup and the execution of the five skills. 

For training, the observation vector concatenates a 7-dimensional Cartesian end- 
effector pose and a 6-dimensional wrench. For each skill, we train corresponding 
HMM models using the Baum—Welch algorithm. The number of hidden states is 
selected such that emission probabilities are maximized leading to distinct and 
uniquely grouped hidden states. 

For results reporting, we use the three factors identified by Pettersson’s survey on 
Event Detection [30]. Namely: diagnoses accuracy, robustness (false-positive rate), 
and reaction time (the time it takes to identify a skill from the beginning of a skill 
execution). Note that for anomaly identification, internal and external perturbations 
are used including: unexpected movement of target object, object absence, slippery 
picks, and human collisions. 


e (A) Skill Identification Performance 


Figure 4.8 presents The skill identification confusion matrix. Skills 1 and 3 were 
recognized with 100% accuracy, 2 and 4 with 99% accuracy, and skill 5 with the 
largest surface contacts with 94% accuracy. Overall accuracy was 98.4%. 
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Fig. 4.7 The Baxter humanoid robot performing a pick-and-place task. 5 independent skills used 
to perform the task. Executed skill motions are sketched with red arrows 


Fig. 4.8 Skill identification SKill identification confusion matrix for pick and place task 
confusion matrix for 
pick-and-place 
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e (B) Reaction Time Performance 


In terms of reaction time, a percentage is computed to assess the time it takes for 
the identification to execute from the beginning of a skill. The reaction percentage, 
using t as “true”, is computed as: offset = predicted-t_beginning, length=t_end- 
t_beginning, and reaction=offset/length. The closer the reaction percentage is to 
0% the better the identification method. A negative reaction percentage means the 
predicted start occurs earlier than ground truth, while a positive percentage implies a 
delayed identification. We assess two forms to determine the beginning of a skill (i) 
use the “first skill” occurrence, or (ii) use the “first 10 successive skill” occurrences. 
The reaction percentage for these two formats is found in Table 4.2. The average 
reaction time for absolute values (i.e. looking at the average time difference of the 
prediction, whether early or late) the “first skill” is 2.70% across all skills and the 
average reaction time for the “first 10 skills” is 0.97%. Between the two measures, 
we have a total average of 1.84%. 


e (C) Anomaly Monitoring Performance 


For anomaly monitoring performance of our gradient-based method, we use two 
environments: (i) anomaly identification as it occurs and any false-positives before 
an actual anomaly occurs, and (ii) an external collision is given to the robot to trigger 
a recovery. Then, when the robot completes its recovery behavior, we count how 
many false-positives are triggered before moving to the next skill execution. The 
robot recovery behavior is detailed in [31]. Five nominal and five anomalous trials 
are used for the analysis. The results are compared with two other baseline methods: 
the magnitude-based metric from the description in normal skill identification based 
on the forward gradient, and the derivative-of-difference metric from the definition 
in anomaly monitoring based on the forward gradient. 
The five anomalous trials contain a total of 14 anomalies, consisting of: 


(i) one anomaly caused by the displacement of the target object 
(ii) one anomaly caused by no target object 
(iii) two anomalies caused by slippery picks 
(iv) five anomalies caused by human collisions to the robot gripper during each skill 
execution 
(v) five anomalies caused by human collisions to the robot arm during each skill 
execution. 


Table 4.2 Reaction time as a duration percentage of a skill 


Method Average reaction percentage 

Skill 1 (%) Skill 2 (%) Skill 3 (%) Skill 4 (%) Skill 5 (%) 
First skill 0.00 0.23 -3.23 0.28 —9.77 
occurrence 
First 10 0.00 0.23 -2.36 0.65 1.61 
occurrences 
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Table 4.3 Anomaly monitoring metrics performance comparison 


Detection metric Micro 

F-score (%) Precision (%) Recall (%) 
Proposed forward-gradient measure | 100 100 100 
Derivative-of-difference measure 82.35 70 100 
Magnitude-based measure 60.47 44.83 92.86 


e (D) Pre-recovery Performance 


For pre-recovery performance computation, during each trial, we record the anoma- 
lies triggered by the testing metric and count its true positives, false positives and 
false negatives. The results are shown in Table 4.3. The result shows our proposed 
forward gradient detected all anomalies and triggered no false positives or false neg- 
atives. The other two baseline methods suffer from false positives though they deliver 
high true positives. 


e (E) Post-recovery Performance 


For post-recovery performance metrics, we trigger an intentional anomaly, after 
recovery is completed, we count any false-positives before next skill execution. Both 
the forward gradient method and the derivative-of-difference method had not false- 
positives. Whilst the magnitude-based metric had more than 10 and prevented the 
system from continuing its task execution. 


e Comparison with Related Works 


Comparisons across works is challenging as results use different formats across 
experiments. Table 4.4 is an effort to harmonize results across related literatures. 
The comparison should be done loosely as different tasks (small levels of contact vs. 
large levels of contacts, structured environment vs. unstructured environment) present 
different challenges to event detection. For skill identification, our current approach 
ranks 2nd behind the tool breakage work that identified anomalies in structured 
milling processes. Our work did better than [16, 22], albeit these works modeled 
more complex dynamical phenomena. Similar statements can be made about anomaly 
identification. As for reaction times, our approach offers about double the speed-up 
compared to the only other work that reported this number. In conclusion, based on 
internal and external evidence, we hold that our measure is the most robust, stable, 
and fastest measure reported to date. 


4.5.4 Discussion 


This work presented a theoretically derived event detection measure useful for nomi- 
nal and anomalous behavior identification, even in post-recovery actions. Our results 
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Table 4.4 Skill and anomaly identification, and reaction time comparison for state-of-the-art event- 


detection methods 


Technique ID accuracy (%) 
AFF/DCC/CSM/SVM [32] 84.66 
sHDP-HMM [16] 89.50 

RCBHT w/multiclass SVM [22] 97.00 

HMM w/GradientBased measure (current) 98.40 

Tool breakage SVM [33] 99.38 

Technique anomalyID (%) Accuracy 
HMM, varying threshold [11] ~80.00 

MLP [12] 83.27 
sHDP-VAR-HMM, mag metric [21] ~85.00 
sHDP-HMM [16] 87.50 

RCBHT w/multiclass SVM [22] 97.00 

HMM, gradient metric (current) 100.00 

Technique Reaction time (%) 
sHDP-VAR-HMM, mag metric [21] 3.70° 

HMM, gradient metric (current) 1.84 


* Average of Table 3 is cited from [32], k-fold: gesture classification performance (novice, inter, and 
expert and both macro/micro levels) 
> Average reaction time for their best multimodal setup 


showed very strong performance compared with state-of-the-art results across the 
board. More experimental validation is certainly necessary: both in number of tri- 
als and robotic tasks. This work also remains to be tested in the area of anomaly 
diagnoses. The latter is concerned not only with the identification problem with the 
grouping of anomaly types which is more challenging. We anticipate working in 
conjunction with machine or deep learning models for the diagnoses of this signals. 
Some works [11, 12, 16, 23] already provide some characterizations. 


4.6 Mapping Latent State to Log-Likelihood for Anomaly 
Monitoring 


4.6.1 Hierarchical Dirichlet Process Vector Auto-regressive 
Hidden Markov Models 


HMMs segment sequential data into interpretable hidden states that are assumed 
Markovian. They are limited by its a priori determination of hidden states (and fixed 
throughout the modeling process) as well as the HMMs limited ability to model com- 
plex dynamics due to the assumption that observations are conditionally independent 
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given the state. Instead, this section uses Bayesian non-parametric techniques that 
model HMMs through priors that can model an unbounded number of hidden states. 
In particular, we use HDP priors to learn HMMs with infinite hidden states. We also 
use the switching vector auto-regressive (HDP-VAR-HMM) to model multimodal 
observation sequences and its element consists of features extracted from an F/T 
sensor, joint encoders, and a tactile sensor (for a discussion on the used features see 
experimental setup). Figure 4.9 illustrates graphical model representations for the 
sHDP-HMM and the sHDP-VAR-HMM. 

Assume that we record N sequential executions for each skill and expect to learn 
the common dynamics among them. We utilize the HDP-VAR-HMM to jointly model 
those sequences, where sequence n € N has multidimensional observation data 
Yn = [y1, 2,---, yr] at time t. For instance, y; € R? could be an instant of wrench 
signatures and end effector velocity, where d is the dimensionality of observed fea- 
tures. The HDP-VAR-HMM interprets each observation y, by assigning it a single 
hidden state z,. The hidden state value is derived from: (i) a countably infinite set 
of consecutive integers z; = k € {1,2,3,...}, (ii) an initial probabilistic state dis- 
tributions 7o generated with Markovian dynamics, and (iii) a transition distribution 
{x}. Then, the Markovian structure on the state sequence dictates that for all 
t>1: 

Zt ~ Tzi? 


p(z = k) = 10, (4.24) 


pr =k z1 = k) = Tg. 


Since z; ~ 7z,» we see that z,_; indexes all observations y; are assigned with z;—1. 
We can draw the observation y; given specific hidden state z; = k from its corre- 
sponding observation likelihood functions. In our case, we consider the first-order 
auto-regressive Gaussian likelihood. As such, observations y, are considered to be a 
noisy linear combination of the previous observation plus additive white noise ¢ and 
can be expressed as: 


Yi = Anyi + Erz, = k), elk) ~ VO, X). (4.25) 


where, each state k has two dynamic parameters {A}, Xk}: Ax, ad x d matrix of 
regression coefficients that defines the expected value of each successive observation 
as E[y;|y;-1] = Akyr-1, and Xk, a d x d symmetric positive definite matrix that 
defines the covariance matrix of state k. 

Therefore, the Gaussian regression observation model explains a observed data 
pairs of input y and output z. Each input is a vector of length y € R, while each 
output is a scalar. In the paper, we assume that the input data y are fixed dependence 
on previous observation and focus on a generative model for learning the output data 
z which depends on y. Various generative models, such as full-covariance Gaussian, 
diagonal-covariance Gaussian, are possible for the observed input data. These can 
be directly define the multivariate Gaussian log-likelihood function given specific 
state Zk, 


4.6 Mapping Latent State to Log-Likelihood for Anomaly Monitoring 75 


K Sticky parameter 


(a) SHDP-HMM 


K Sticky parameter 


Fig. 4.9 Graphical representation of HDP-HMM and HDP-VAR-HMM. a The graphical repre- 
sentation of a sticky HDP-HMM over T time steps. The latent, discrete-valued Markov process 
Zr captures the temporal dependencies in the observations y;. b The graphical model for a sticky 
HDP-VAR (r)-HMM over T time steps. The latent, discrete-valued Markov process z; dictates the 
linear dynamical model at time t. For an order r switching VAR process (r = 1 in this paper), the 
conditionally linear dynamics are completely determined by previous observations y;—1 


(b) SHDP-VAR-HMM 
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log p(ytl@x, yr—1) = log P(t lA, Xk, Yr—1) = log N (wr |Aky—-1, Xk) 


r A Dr A 
z0! kYt—1) Xy Ot — AkYyt-1). 
(4.26) 


— logn) — llog| 5 
= Caen 7/08! kl 


However, it is often the case that we don’t know the parameters of this distribution, 
but instead to estimate them. We need to learn the approximate values & = {Ax, Xk} 
for each state by defining a conjugate prior distribution on it. Particularly, the Matrix 
Normal Inverse Wishart (MNIW) distribution is conjugate for both the A; and X are 
uncertain. This distribution assume that when only the covariance X% is uncertain, 
the conjugate prior is defined as d-dimensional Inverse Wishart (IW) distribution 
with covariance parameter A, a symmetric, positive definite scale matrix, and v, the 
degrees of freedom, i.e. 

Er ~ LW v, A), (4.27) 


where the first moment is given by the mean 


S] = M. (4.28) 


After that, we place a conditionally Matrix-Normal (MN) prior on A; given Xk, 
which is given by 
Ak ~ MNM, Xr, V), (4.29) 


where M is the mean matrix of Ag and V is a symmetric, positive definite matrix 
that determine the covariance of A, relative to X. Therefore, we derive the useful 
expectations 


i(Ay) = M, 


T (4.30) 
Cov[A;] = EEES x A). 


The resulting log-likelihood function of joint distribution of Az, A, is defined as 
log p(Ax, Xlv, A, M, V!) 


logp(A;, Ajlv, A, M, V') = Fey w, A) + Fay (M, V) + F, (4.31) 
where, 


d 
d(d-1 d 
Fwy, A) = ( 7 liggt z 082v L'or ( 
i=l 


v+l-d 
2 


+ Żlog]|Al| 
UiA 
7 8 


a d 
Fun (M,V) = ~ 5 logan + 5 los lVI, 


v—1 
2 


F = 


1 1 
log|Ax| — 31r (Ag ARAL V) +tr(ApAgVM?) — zk + MVM?‘)). 
(4.32) 
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where I () is represented the gamma function. Therefore, all the optional hyperpa- 
rameters of HDP-VAR-HMM are specified by v, A, M, V. Apparently this straight- 
forwardly method is computationally feasible when those four parameters are known. 
We will specify the detailed options in experimental section through all our experi- 
ments. 

With reference to the Eq. (4.24), for the probabilistic representation of transition 
distribution x; under the HDP-VAR-HMM prior, the number of states is unbounded 
and every observation assigned a unique state. The HDP prior encourages shar- 
ing states over time through a latent root probability vector 6 over the infinite set 
of states (see Fig. 4.9). Consider the probability a É; = 1 on a countably infi- 
nite set, where defines the sticking-breaking construction of the prior on 6 and 
first draws independent variable u; ~ Beta(1, y) for each state j, and then sets 
Bj = bj Vee; (1 — ue), where the concentration parameter y controls the relative 
proportion of the weights 6;. Here, we interpret jz; as the conditional probability of 
choosing state j amongs states {j, j +1, j +2,...} 

In expectation, we always describe the first K orders in stick-breaking as the 
most common states and represent their probabilities with the vector [f1, B2,..., 
Br, >x], where Bx = ae x41: Given this (K+1) dimensional probability vector 
p, we simplify the HDP-VAR-HMM transition distributions 7x ; for each state j from 
a Dirichlet distribution with mean equal to 6 and variance governed by concentration 
parameter a > 0 


[Bi, B2,-.-, Bx, B>x] ~ Dir (api, af2,...,0B>xK), (4.33) 


Generally, we define the starting probability vector zo from the same prior, but with 
much smaller variance, mo ~ Dir (apf) with o > a, which denotes that few starting 
states are observed. 

To capture the sufficient dynamics of given time series n with T, observations. 
Our goal is to interpret these observation with the hidden state z;. However, instead 
of assigning variable states in a short time period, we should treat each observa- 
tion in the context of its predecessors for improving the temporal consistency. A 
“sticky” parameter « of favors self-transition by assigning extra prior mass on the 
transition probability xj; was introduced. Thus, the transition representation will in 
a temporally consistent way, that is 


[Bi, B2,..., Bx, B>x] ~ Dir(apı, ... aß; +k,...,B>xK). (4.34) 


where «x > 0 governs the degree of self-transition bias. Informally, what we are doing 
is increasing the expected probability of self-transition for keeping the dynamical 
phenomenon consistency for many time-steps. 

After constructing the sHDP-VAR-HMM model, we utilize the state-of-the-art 
variational inference algorithm for learning the parameters, which named as Split- 
Merge Monte Carlo method by Michael C. hughes et al. More information can be 
found in [28]. 
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4.6.2 Anomaly Monitoring Based on Hidden State 


The anomaly detector identifies whether a skill execution contains: no anomalies, 
one anomaly, or multiple anomalies. In this section, we introduce the mathematical 
modelling that makes up our detector. 


e (A) Hidden State Representation 


Hidden state space modeling of multivariate time-series is one of the most important 
tasks in representation learning by dimensional reduction. In this work, we propose 
the anomaly detector is a thresholding marginal log-likelihood with Bayesian non- 
parametric hidden Markov model, which leads to tackle the anomaly monitoring 
problem in a more natural way to meet the need of real-world applications. 

As we discussed above, the sHDP- VAR-HMM interpreted each observation y, by 
assigning to a single hidden state z,, where the hidden state value is derived from a 
countably infinite set of consecutive integers z; = k € {1,2,3,..., K}. Figure 4.10 
shows the low-dimensional hidden state space of the multi-modal time series gener- 
ated by the robot during manipulation, which contains 5 hidden states. 

We denote ©, to represent all the parameters of the trained sHDP-VAR-HMM 
from nominal demonstrations, including the hyperparameters for learning the poste- 
rior and the parameters for the observation model definition. Our anomaly detector 
performs anomaly monitoring by comparing the marginal log-likelihood log p(y;|@s) 
based on current executing skill of the observation y, with a learned threshold p(z), 
that derives from the most likely estimated hidden state given current observations. 


z = arg max p(Z;|);, Os). (4.35) 
1 


Multimodal Signals 


Time(sec) 


Fig. 4.10 An illustration for the representation of multimodal time series in low dimensional hidden 
state space 


4.6 Mapping Latent State to Log-Likelihood for Anomaly Monitoring 79 


Here, the z would be included variable values, that is, we should synchronously 
update the threshold in the same skill period. Let’s say the definition of an anomaly 
data point at any time during the manipulation is one that logp(y,|©,) < p(z), oth- 
erwise it deems the manipulation to be nominal. 

The proposed method generates the mapping pairs of marginal log-likelihood and 
the most likely hidden state at any time. As specific hidden state z, we calculated the 
fixed log-likelihood threshold from testing the nominal demonstrations by subtract- 
ing half of range between maximum and minimum from the minimum. Unlike the 
anomalous log-likelihood was deviate by a certain standard deviation from the mean 
in [11, 12, 21], that is o(z) = u — co, where the symbol c is a constant value for 
adjusting the sensitivity of the detector, u and ø are the estimated mean and standard 
deviation of the log-likelihood. Our implementation don’t need to take the constant 
c into consideration for each skill or hidden state. 


e (B) Joint Probability of the Observed Sequence 


The threshold p(z) is associated with the joint probability of the observed sequence. 
Firstly, in order to derive the forward-backward procedure of first-order auto- 
regressive HMM for the joint state sequence z,.7 given observation yı:r, plus the 
initial observation yo, (T > 1). We have this problem when dealing with conven- 
tional HMM, here the auto-regressive structure of ARHMM makes this situation 
more complicated, a modified forward-backward method could solve the prob- 
lem efficiently. Let’s first write the joint distribution with the Markov property 


POYtlY-1, ++. Y1) = POr|Y-1), that is 


T 
po:r) = bm P(Yl¥1x-1), 
t=1 


k (4.36) 


POY) = D> Pr = k|yra-1)P Orle). 
k=1 


Where @,, represents the observation parameters for the trained HMM and the z).7 
is a state path over the hidden state space. As such, if the state path is estimated, 
the maximum probability of the observation sequence can be obtained. However, 
since the true state path is hidden from the observations. For the computational 
convenience, the general approach would be to use the maximum likelihood state at 
any given moment, but this would neglect uncertainty. Instead, we use a marginal 
probabilistic representation over hidden states at each time step (see details in hidden 
state estimation of auto-regressive model section). Thus, the log-probability at time 
t is derived by computing the logarithm of the sum of exponentials over all hidden 
states 


= a; (k)- B;(k) 
=l ———— _], 4.37 
2 i ( P(yo:r) ) eel 
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where œ; (k) = p(z; = kl yo-r), Br(K) = POr+i:7 |Z), are presented the forward mes- 
sage passing and backward message in the standard forward-backward algorithm, 
respectively. 


1 
a, (k) = ————~ p(y, |zr = k) p(Zt = Kl 11-1), 
P(Yt\Yo:t-1) 


P = k|yor1) = È Plz = A) PEY). 
j 


(4.38) 


The denominator represents the probability of a sequence observations, which can 
be calculated by Eq. (4.36). Assume that we recorded N sequential data for each skill 
and the length of sequence n € N is T,. Thus, we can get a set of log-probabilities 
vector of each skill 


AN ee eo LA): (4.39) 


The above implies that the forward—backward procedure can be used to find the 
most likely state for any moment t. It cannot, however, be used to find the most likely 
sequence of states. 


e (C) Hidden State Estimation of Auto-regressive Model 


For offline constructing the state and log-likelihood pairs, we are interested in the 
most likely state sequence upon an observation sequence yo:r. One might be derived 
from copulating the MAP sequence by maximizing the marginal probability at each 
observation i 

2; = arg max p(zilyo:r)- (4.40) 


Zi 


We note this combined single observation estimated sequence might yield unlikely 
(even 0-probability) sequences. We therefore consider 


Zr = arg max p(Z1:7, yo:r)- (4.41) 


ZIET 


where, p(Z1:7, Yo:r) was computed in Eq. (4.36). Actually, we note the most likely 
state sequence that minimum the cost path to z; is equivalent to the minimum cost 
path to node z,_; plus the cost of a transition from z;_; to zz, and the cost incurred 
by observation yz. 

For convenience, we abbreviate the joint log-likelihood function Z (zı:r) = 
log p(Z1:r, Yo:r), and the transition probability z,,_, £ z,_1, then 


T 


Ly (ur) = 8121) + > SEn Ta), (4.42) 


t=2 


where 
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81(Z1) = log (mz, pO1|8z,, Yo), t=1 


(4.43) 
Si (Zr, Zr-1) = log (p(Zi|Z:-1) POF, Yr-1)), t22 
Obviously, 4 (Z1:) = 2-1 (Z111) + St (Zr, Z—-1), and set 
vi (z) = max ZL, (Z1:1), Z € {1,2,3,..., K} 
£1:t-1 (4.44) 


Uy (Z1) = max {U;—1 (Zr) + Str, Z1 
Zt 


From the above, The maximizing state sequence is iteratively obtained, for n € 
{T —1,T —2,...,1} 


Zr = arg max Upn(Z-1); (4.45) 
Zt-1 


Assume that we recorded N sequential data for each skill and the length of sequence 
n € N is T,. Thus, we can get a set of the most probable states vector of each skill 


BEN S ee eee ae (4.46) 


e (D) Threshold Calculation 


Having computed the joint probability of the observations and the most probable 
states for each skill s. we can concatenate them to compute the threshold of specific 
state z = k, that is p(z). Each state consists of its maximum a and minimum 
a ” We cluster the log-likelihoods with the same state, which represents the clus- 
tering data by HMM belong to the same statistical distribution. 


ee = max L"1(z,==k), 


ie{1,...,N} 
Ze = min 2 Nee == k), (4.47) 
k ie{l,..., N} 


2 = L = gmn, 


where the notation 1(7) to represent an indicator random variable that is 1 if the 
event 7 occurs and 0, otherwise. Using Eq. (4.47), we can set the anomaly monitoring 
threshold of each active state z, at any moment according to: 


range 


PE) =L" EE. (4.48) 


In testing implementation, we preprocessed the incoming multimodal data in the 
same scheme as the training process and then get the most possible state Zest by 
recursively compute the filtered marginals and the corresponding log-likelihood est 
using Eqs. (4.38) and (4.37) respectively. 


Zest = arg max {p(z; = llyox),..., Ps = Kl yor)}. (4.49) 
1 K 


jas 
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Here, this approach detects an anomaly when the log-likelihood is lower than the 
state dependent threshold p(Zrest) This approach also extends to online detection. 
Given the trained sHDP-VAR-HMM and streaming sensor data, the detector can 
recursively compute the z; and %4 at time t, and then perform the comparison, if 
Lı < (Zest), then anomaly, else nominal. 


4.6.3 Experiments and Results 


e Finite State Machine Based Kitting Experiment 


In this section, we describe the kitting experimental setup, external disturbances, 
data collection, and anomaly monitoring and diagnoses model parameters. 

A kitting experiment is setup, where a robot and a human co-worker closely 
collaborate to place a set of goods in a packaging box. Specifically, a human co- 
worker is tasked to place a set of 6 objects on the robot’s “collection bin” (located 
in front of the robot) in a one-at-a-time fashion. The objects may accumulate in a 
queue in front of the robot. As soon as the first object is placed on the table, the robot 
identifies the object and places it in a packaging box located to the right of the robot. 
Thus, the robot picks an object and transports it towards the box, after which, the 
robot appropriately places each of the six objects in different parts of the box. The 
whole implementation is shown in Fig. 4.11. 

The kitting experiment consists of 6 skills (Skill 3) : Home — Pre-pick; (Skill 
4) : Pre-pick — Pick; (Skill 5): Pick —> Pre-pick; (Skill 7) : Pre-pick — Pre-place; 
(Skill 8) : Pre-place — Place; (Skill 9) : Place — Pre-place that were effectively 


Velocity 


Tactile 


Sensors are mounted 
on the right arm 


Fig. 4.11 A kitting experiment consists of 6 skills (in various color) that were modeled by DMP, 
respectively. Objects that need to be packaged are placed by a human collaborator before the robot 
in acollection bin. The shared workspace affords possibilities for accidental contact and unexpected 
alteration of the environment. The robot is tasked to pick each one of the objects and place them in 
the packaging box to its right. The visualization module uses the ALVAR tags to provide a consistent 
global pose with respect to the base of the robot. Ten nominal executions and one snapshot of each 
skill are shown 
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modeled by the dynamic movement primitives and the ROS-SMACH.! One-shot 
kinesthetic demonstrations were used to train DMP models for each of the skills of 
the kitting experiment. Note that a skill’s starting and ending pose can be adapted 
if necessary thus providing a flexible and generalizable skill encoding procedure. 
The trajectory adaptations however increase the difficulty of assessing the sensory 
data for both anomaly monitoring and diagnoses—due to the variability in the data 
collection even from nominal executions. 


e The Robot, Ojects, and Visual System 


A Baxter humanoid robot’s right arm is used to pick commonplace objects set before 
him. The robot is equipped with a 6 DoF Robotiq FT sensor, 2 Baxter-standard 
electric pinching fingers, as well as Baxter’s left hand camera. Each finger is further 
equipped with a multimodal tactile sensor composed of: (i) a 4 x 7 taxel matrix 
that yields absolute pressure values, (ii) a dynamic sensor which provides a single 
capacitive reading in millivolts (mV) useful to detect tactile events, and (iii) an IMU 
and gyroscope [34]. Baxter’s left hand camera is placed flexibly in a region that can 
capture objects in the collection bin with a resolution of 1280 x 800 at 1 fps (we are 
optimizing pose accuracy and lower computational complexity in the system). The 
use of the left hand camera facilitated calibration and object tracking accuracy. 

A set of 6 common household objects consisting of box-like shapes and bottles 
were used in our work. The objects ranged in weight from 0.0308 to 1.055kg and 
in volume from 3.2 x 10704m? to 1 x 107° m3. The object’s surfaces also varied 
slightly: some heavier objects had sleeker surfaces to incite object slips—we believe 
not an unreasonable determination as warehouses contain a wide variety of objects— 
whilst other objects had rougher surfaces. Across executions, object locations and 
order was varied to promote generalization. 

Alvar tags,” with 0.06 m sides, were placed around the circumference of the objects 
for robust visual recognition (ALVAR can handle change in lighting conditions, 
optical flow-based tracking, and good performance for multitag scenarios) regardless 
of orientation. 


e External Disturbances 


External disturbances may be introduced into a system for a variety of reasons. We 
list possible scenarios in which disturbances may occur in real-systems, as shown in 
Fig. 4.12: 


(1) For collaborative jobs (such as kitting in warehouses, see the description in 
experiment setup section), humans may experience monotony leading to bore- 
dom and loss-of-attention. In such cases, it is possible for a human co-worker 
to accidentally collide with the robot manipulator or alter the environment in 
unexpected ways. 


“http://www.ros.org/SMACH. 
2http:// wiki.ros.org/ar_track_alvar. 
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(e) WC (f) NO (g) OTHER 


Fig. 4.12 Seven representative unexpected anomalies in the kitting experiment (left-to-right): 
Human Collision (HC), Tool Collision (TC), Object Slip (OS), Human Collision with Object (HCO), 
Wall Collision (WC), No Object (NO), Other (OTHER) 


(2) The user may also accidentally collide or unintentional move packaging objects 
in ways a robot may not anticipate. Such variations in the world may lead to 
tool-collisions. 

(3) Picked objects may slip from a robot’s gripper if the grasp is not optimal; or if 
upon motion, inertial forces acting on the object cause slippage that breaks the 
grasp. 

(4) Object slips may also be caused due to human collisions. Such phenomena 
depicts a chain of anomaly reactions where one anomaly leads to another. 

(5) The robot may also collide with packaging box during kitting. 

(6) Missed-grasps (where the robot pinches air) are possible when the target object 
moved without the robot’s notice. 

(7) Additionally, it is possible to suffer from false-positives from the anomaly detec- 
tor. False-positives may occur for numerous reasons: system errors, unreachable 
objects, unfeasible inverse kinematic solutions, unidentifiable objects from the 
visual system, to name a few. 


The seven types of anomalies we just described will be declared more succinctly 
in the rest of the paper as: Human-Collisions (HC), Tool-Collisions (TC), Object 
Slips (OS), Human-Collision-with-Object (HCO), Wall-Collision (WC), No-Object 
(NO), and Other (OTHER) respectively. In Sect. 4.6.1, the introspection methodology 
used to model robot skills, continued by a presentation of anomaly monitoring in 
Sect. 4.6.2 and anomaly diagnoses in Chap. 5. 
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e Dataset Collection and Sensory Preprocessing 


It is undeniable that bias exists on the side of robot researchers as to which anomalies 
may exist in the task. In this work, we first attempt to discover likely anomalies in this 
task by emulating a collaborative kitting task in which the human collaborator expe- 
riences tedium and monotony resulting in undesired events that lead to anomalous 
events. 

We tasked 5 participants—whose ages range from 23 to 28—to act as a collabo- 
rator in the kitting task under the monotonous conditions mentioned in experiment 
setup section. One expert user and four novice users participated in the experiments. 
Novice users learned from the expert to induce anomalies by observing a round of 
executions. Each participant performed 1 nominal and 6 anomalous executions by 
placing the set of six household objects in a one-by-one fashion. For training, when 
an anomaly occurs, the robot execution is halted and the data stream is labelled with 
an anomaly class (namely the labels with HC, TC, OS, NO, etc.) for supervised clas- 
sification during testing. Subsequently, anomalous time-stamps are also extracted. 
In total, we collected a data set with 18 nominal executions and 180 anomalous exe- 
cutions (each anomalous trial has at least one anomaly across the entire task), and 
38 failure trails are ignored. 

As for our multimodal observation vector, it is comprised of 6 DoF force-torque 
signals, 6 DoF Cartesian velocity signals (from Baxter’s right end-effector), and 
56 DoF tactile signals from both the left and right tactile sensors. Empirical feature 
extraction is performed on this observation vector to improve diagnoses performance. 
We now describe how features for each modality at time t were engineered. 
Wrench modality: Assume the force and torque sequence is a time-series vector and 
that each element represents the magnitude of each dimension (fx, fy, fz» tx, ty, tz). 
We wish to extract structural information from the wrench instead of simply using 
raw values. In this way, we can find signal patterns that may occur across the different 
DoFs. To this end, we computed the norm of both the force n ¢ and the torque n, as 
features respectively: 


Reali, ties Maye Eee: (4.50) 


Velocity modality: Similarly, we take the Cartesian linear (/,,/,,/,) and angular 
(ax, ay, az) velocities and compute their norm n; and na respectively: 


n = Rane, Ng = a? +a? Ge: (4.51) 


Tactile modality: Due to the computational cost of processing the tactile sensor’s 
high dimensionality, we empirically tested a number of features for each tactile 
sensor, they include: the maximum taxel value, the largest five taxel values, the mean 
of all taxel values, and the standard deviation for all taxel values. It was the standard 
deviation, which proved to be the most useful feature for anomaly monitoring and 
diagnoses. The standard deviation for each tactile sensor s; and s, are defined as: 
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y= N li — me = li — : 4.52 

r= 2 i=) S = | dt Hr) (4.52) 


where u; = x ye l; and u, = x ye r; are the mean of each tactile sensor 
respectively. Equations (4.50)-(4.52) facilitate the identification of diverging signals 
during anomalous situations. We empirically concatenate all the valuable features 
for representing the robot executions both in nominal and anomalous cases. If raw 
signals are used directly, they result in high false-positive detection rates (FPR) in 
our experimentation. For instance, when a robot carries objects of varying weights, 
the F/T signals for such operations vary significantly making it difficult to use raw 
values for detection. Hence, a standardization method is used to scale the extracted 
features through a normalization constant. We ultimately end with an feature vector 
of length 18 as shown below: 


ik fh fot th k np n nı So Sp 
ye = ’ $ ’ ’ $ $ ’ s lx, ly, lz, ax, Ay, Az, s Ma, ’ * 
10 10 10 10 10 10 10 0.4 : 0.4 250 250 
(4.53) 
Note, that all training and testing data are pre-processed in the same manner. We 
will also use the same pre-processing approach when comparing with other baseline 
approaches. 


e Basic Parameter Settings of Each Model 


For anomaly monitoring and diagnoses using the sHDP-VAR-HMM with a first-order 
autoregressive Gaussian likelihood, each state k has two parameters to be defined: 
the regression coefficient matrix A; and the precision matrix A+. The conjugate prior 
is the MNIW for which four parameters v, A, V, M are assigned values in a fashion 
similar to the Motion-Capture-Dataset in [28]. In order to guarantee the prior has 
a valid mean, the degrees-of-freedom variable is set with v = d + 2 and we set A 
by constraining the mean or expectation of the covariance matrix E[ A, 1] under the 
Wishart prior in Eq. (4.30). 


NT, 


1A, = sr Y Y O-o. (4.54) 


=l t=1 


Assume that we record N sequential data for each skill and the length of sequence 
n € N is T,. Thus, we can easily define the parameter A accordingly as 


A= w-d — DELA; ']. (4.55) 


Specifically, we placed a nominal prior on the mean parameter with mean equal to 
the empirical mean and expected covariance equal to a scalar sp times the empirical 
covariance, here sr = 1.0. This setting is motivated by the fact that the covariance 
is computed from polling all of the data and it tends to overestimate mode-specific 
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covariances. A value slightly less than or equal to 1 of the constant in the scale 
matrix mitigates the overestimation. Also, setting the prior from the data can move 
the distribution mass to reasonable parameter space values. The mean matrix M and V 
are set such that the mass of the Matrix-Normal distribution is centered around stable 
dynamic matrices while allowing variability in the matrix values (see Sect. 4.6.1 for 
details). 

M = 04, 


(4.56) 
V = 1.0 > ių. 

where 0, and I, are the matrix of all zeros and the identity matrix, respectively, each 
of size d x d. For the concentration parameters, a Gamma(a, b) prior is set on HDP 
concentration parameters œ. A Beta(c, d) prior is set on the self-transition parameter 
u and the degree of self-transition bias x is set to 50. We choose a weekly informative 
setting and choose: a = 0.5, b = 5, c = 1,d = 5. Specifically, the initial transition 
proportion parameter is set ų ~ Beta(1, 10). The Split-Merge Monte Carlo method 
sampling parameter maximum iterations is set to 1000. The truncation states number 
is set to K = 5 for anomaly monitoring, and K = 10 for anomaly diagnoses. Then, 
the HDPHMM model of each anomaly class or nominal skill is trained. Note, that as 
with pre-processing, we also use the same model parameters in training and testing 
as well as in other baseline approaches. 


e Results and Analysis of Anomaly Monitoring in Kitting Experiment 


According to Sect. 4.6.2, the anomaly detector implementation consists of represent- 
ing multimodal observations through latent states, concatenating the derived log- 
likelihood values of a latent state, and calculating the anomaly monitoring threshold. 

We first generate the latent state partitions for nominal demonstrations with vary- 
ing speeds and goals. Notice how the non-parametric method yields varying dynamics 
across executions reflecting statistical variations in the robot motion trajectory due 
to changes in the desired trajectory completion time and varying goal distances. 

Then, latent state log-likelihood values are concatenated and the corresponding 
thresholds computed. To this end, we visualized the performance of the proposed 
method as well as a baseline by assessing the anomaly time reaction of anomaly 
flags; that is, which algorithm can make a correct detection at a time closer to the 
ground truth. For training, ten nominal executions were used along with 3-fold cross 
validation to obtain nominal sHDP-VAR-HMM models for each skill. The other 20 
nominal executions were used for testing. 

The execution varying threshold for anomaly monitoring was set according to 
Eq. (4.48). The results for skill 3 are visualized in Fig.4.13. As seen in Fig. 4.13, 
latent state zhat = 2 is the most prominent latent state (the majority of the observa- 
tions in the skill are ascribed to this state). We now analyze the time sensitivity of our 
system in detecting anomalies. Anomaly identification sensitivity is intimately con- 
nected to threshold setting. We compare the results of our execution varying thresh- 
old with a fixed threshold presented in [35]. The comparison is seen in Fig. 4.14. 
Under nominal situations both methods have the same robustness in avoiding false- 
positives. In anomalous situations, there are multiple occasions in which our varying 
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Fig. 4.13 All the log-likelihood values of individual hidden state of skill 3 when the truncation 
states number is set to K = 5. Each subplot shows the mean value (black dotted line) and the 
corresponding log-likelihood threshold value (red solid line) in Eq. (4.48) 


threshold yielded better time sensitivity and detection accuracy. From Fig. 4.14, we 
see that for executions 2 and 4, our varying threshold detector identified anoma- 
lies that the fixed threshold method could not. Notice that the distance between 
the anomaly_t_by_human and the anomaly_t_by_detector is larger for the fixed 
threshold (recall from the dataset collection section that ground truth time stamps 
were obtained during training). This improved sensitivity reduces false-negatives 
and helps us flag anomalies at times closer to the ground truth. 

Finally, Table 4.5 shows the anomaly monitoring performance for our hidden- 
state detector (HSD) and the baseline gradient-based detector (GD) for the six skills 
used in the kitting task for 368 nominal executions and 136 anomalous executions. 
Our results indicate that while the GD system had the HSD resulted in higher 16.6% 
better precision, our system had 76.0% better recall. The difference is significant in 
reducing false-negatives and significantly raising the sensitivity of our system. The 
sensitivity improvement is almost four times bigger than the difference in precision. 
For this reason, our system reflects an Fl-score that is 25.0% larger and an accuracy 
that is Fl-score, and an accuracy that is 5.7% higher standing at 91.0%. 
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Fig. 4.14 The comparison of two anomaly detectors for five robot anomalous executions. Each 
row represents the same testing execution. Left column: results of our hidden state-based anomaly 
detector. Right column: results of the gradient-based anomaly detector [21]. The ground truth 
(anomaly_t_by_human) are represented by green dashed lines and the detected anomalies are 
represented by yellow solid lines 


4.6.4 Discussion 


Comprehensive experimental results showed robust, sensitive, and better timed 
anomaly monitoring for a wide range of anomalous situations in an unstructured 
co-bot scenario where a human and a robot collaborated to complete kitting tasks. 
The improved anomaly detector exploits the mapping relationship between latent 
states and the marginal log-likelihood values from the sHDP-VAR-HMM for mon- 
itoring anomalies. Our evaluations showed that we could detect anomalies reliably 
(overall accuracy of 91.0%). 
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Table 4.5 The comparison of anomaly monitoring performance measures between our proposed 
method and a baseline method 


Kitting experiment Testing information Precision Recall Fl-score Accuracy 

Movements Success Anomalies |GD |HSD |GD |HSD |GD_ | HSD | GD HSD 
Home — Pre-prick 141 6 1.0 1.0 0.33 | 0.50 | 0.50 | 0.67 | 0.973 | 0.980 
Pre-prick — Prick 112 23 1.0 1.0 0.78 | 0.78 | 0.878 | 0.88 | 0.963 | 0.963 
Prick — Pre-prick 83 28 1.0 0.80 | 0.04 | 0.31 | 0.069 | 0.45 | 0.757 | 0.820 
Pre-prick — Pre-place 13 70 1.0 0.84 |0.54 |0.97 | 0.704 | 0.90 |0.614 | 0.843 
Pre-place — Place 11 6 1.0 0.40 | 1.0 1.0 1.0 0.57 | 1.0 0.769 
Place — Pre-Place 8 3 0.0 0.0 0.0 0.0 0.0 0.0 0.889 | 0.889 
Across all skills 368 136 0.98 | 0.84 | 0.47 | 0.76 | 0.64 | 0.80 | 0.861 | 0.910 


With regards to anomaly monitoring, Table 4.5 indicated that our proposed detec- 
tor achieved unexpectedly superior recall measurements compared to the baseline 
(the gradient-based method), an improvement of 61.7%. By adjusting the anomaly 
monitoring threshold during the task’s progress execution task as a function of the 
maximum and minimum of the log-likelihood of observations for latent states, we 
gained enhanced sensitivity. Human-centric safety approaches dictate that increased 
sensitivity is highly desirable in human-robot collaboration tasks so as to provide 
higher safety guarantees for a human. Robotic workcells on the other hand may 
work better with approaches like the gradient-based detector that was less sensitive 
but had higher precision. Here the reduction in the false-positive rate may support 
longer-term autonomy in a workcell. The anomaly detector exhibited not only a bet- 
ter accuracy but also more accurately timed anomaly flags. That is, our anomaly 
flags occurred closer to the ground truth time. This would be useful. Timeliness may 
prove a critical role in cases where anomalies are triggered as a chain reaction. That 
is anomalies that cause more anomalies. By acquiring more timely detection, we 
may be able to also provide more timely recovery techniques or even prevention so 
as to minimize the possibility of a cascade of anomalies. 


4.7 Summary 


In this chapter, there are three kinds of anomaly monitoring methods in various 
threshold are presented based on the Bayesian nonparametric HMMs, namely: log- 
likelihood-based threshold, threshold based on the gradient of log-likelihood, and 
computing the threshold by mapping latent state to log-likelihood. All proposed 
methods and implementation processes were validated on the experimental platforms 
of “HIRO-NX robot performs electronic component assembly tasks” and “Baxter 
robot performs object packing tasks” using the ROC curve or accuracy-precision- 
recall rate-F1 score as a performance evaluation indicator. The results are as follows: 


(1) The anomaly monitoring method using log-likelihood-based threshold can effec- 
tively realize the online monitoring, but there are a large number of false pos- 
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(2) 


(3) 
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itive (false triggering) and has an uncertain constant coefficient c. For the task 
of HIRO-NX robot performing the electronic components assembly, sHDP- 
VAR(2)-HMM obtains the optimal performance where the constant value of the 
threshold is c = 3 and the observation model is the second-order linear Gaus- 
sian regression model, resulting in a false positive rate of about 5.0% and a true 
positive rate of about 95.0%. 

The method of threshold based on the gradient of log-likelihood can effectively 
solve the problem of false triggering, but resulting in low sensitivity and moni- 
toring success rate with a fixed lower threshold. For the task of Baxter robot to 
performing object kitting experiment, we obtain the average monitoring accu- 
racy of a total of 67 anomaly events is 93.8%, far better than the 59.2% of the 
log-likelihood-based threshold. 

The anomaly monitoring method based on computing the threshold by mapping 
latent state to log-likelihood uses the dynamic threshold to solve the problem 
of false triggering, low sensitivity and success rate, but the dynamic threshold 
depends on the inference of the belief state, there will be a rapid transition 
phenomenon, resulting in a higher risk of false positive rate. For the task of 
Baxter robot to performing object kitting experiment, resulting in the average 
monitoring accuracy of a total of 136 abnormal events in the is 91.0%, which 
is much better than the 86.0% of the threshold based on the gradient of log- 
likelihood. 
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Chapter 5 A) 
Nonparametric Bayesian Method ciecie; 
for Robot Anomaly Diagnose 


Abstract In this chapter, we introduce two novel anomaly diagnose methods using 
the Bayesian nonparametric hidden Markov models when anomaly triggered, includ- 
ing i)multi-class classifier based on nonparametric models, ii) sparse representation 
by statistical feature extraction for anomaly diagnose. Additionally, the detail pro- 
cedure for anomaly sample definition, the supervised learning dataset collection as 
well as the data augmentation of insufficient samples are also declared. We evaluated 
the proposed methods with a multi-step human-robot collaboration objects kitting 
task on Baxter robot, the performance and results are presented of each method 
respectively. 


5.1 Introduction 


In addition, anomaly diagnoses refers to the process of clearly classifying anoma- 
lies using supervised learning on the premise that the robot detects the occurrence 
of anomalies. Due to the uncertainty, diversity and unpredictability of anomalies, 
the robot’s anomaly diagnoses can effectively improve the safety performance of 
the robot system and provide guarantee for subsequent anomaly repair behaviors. 
Therefore, the online abnormal monitoring and diagnoses of robots is the focus and 
difficulty of robots’ long-term autonomous operation. 


5.2 Related Work 


The robot introspection not only needs to implement the movement identification and 
monitoring of anomalies, but also to diagnose the types of anomalies for providing 
sufficient support in subsequent recovery. The samples used for anomaly diagnoses 
in this chapter would have the same modal information as in anomaly monitor- 
ing and belong to multidimensional time series. Since the problem of multidimen- 
sional time series classification is a supervised learning problem [15-17], which 
aims to determine the labels of multidimensional time series of the same structure 
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and length, Orsenigo et al. [18] proposed a discrete support vector machine (Discrete 
Support Vector Machine) Machine (DS VM) classification method. This method ben- 
efits from the concepts of warping distance and softened variable margin and realizes 
the classification of multidimensional time series. Lee et al. [19] proposed a time 
series classification method based on K-Nearest Neighbor [20] (K-Nearest Neigh- 
bor, KNN), which was successfully used to evaluate the traffic prediction application 
of the mobile telecommunications industry. Seto et al. [21] combined dynamic time 
warping [8, 22] (DTW) and KNN to derive a multivariate time series classification 
based on dynamic time warping template selection, and applied it to human activity 
recognition. In addition to the above-mentioned distance-based multidimensional 
time series classification method, the method of establishing feature vectors through 
dimensionality reduction methods has also received attention [23—25]. Nanopoulos et 
al. [26] proposed a method based on statistical feature extraction to establish feature 
vectors, and then trained a multi-layer perceptron (MLP) from the feature vectors and 
target categories to achieve the classification of time series. Utomo et al. [27] pro- 
posed to classify the multidimensional data of the hospital intensive care unit by the 
method of multidimensional compression description (MultiCoRe), which extracted 
the features including time domain and frequency domain for classification. Jaakkola 
et al. [28] introduced a method combining HMM and SVM for classifying protein 
domains. This method can also be applied to other fields of biological sequence 
analysis. Raman et al. [29] proposed the use of a multi-layer HDP-HMM modeling 
method to achieve classification of human movements, training HDP-HMM from 
all movement motion samples, and building a multi-objective classifier from this 
model. Test the size of the log-likelihood function of the specimen to achieve classi- 
fication. Dilello et al. [30] used sHDP-HMM to classify robots with multiple modal 
anomalies. The model assumes that the observations are independent of each other, 
which weakens the correlation between abnormal data to some extent. Although the 
training of neural networks with small samples is easy to cause overfitting and diffi- 
cult to achieve online monitoring, but because of its better modeling capabilities and 
less data pre-processing process, multivariate time based on deep learning Sequence 
classification is currently being extensively studied by scholars [31-34]. 

Besides the aforementioned that multivariate time series classification has received 
significant interest in areas such as healthcare, object recognition, and human action 
recognition [35-37]. Meanwhile, most of the solutions to the classification problem 
of multidimensional time series are solved by supervised learning, that is, the multi- 
objective classifier learning is performed by manually labeled samples in advance by 
humans. In view of this, the anomaly diagnoses problem considered in this chapter 
is mainly for the classification of conventional anomaly samples with multimodel 
observation of robotic systems. In robotics, complex semi-autonomous systems that 
have provided extensive assistance to humans, anomalies are occur occasionally [39]. 
For this reason, enabling accurate, robust, and fast anomaly monitoring and diag- 
noses in robots has the potential to enable more effective, safer, and more autonomous 
systems [38, 49]. Execution monitoring, especially with the focus of detecting and 
classifying anomalous executions has been well studied in robotics [40]. Daehyung 
Park et al.[39] introduced a multimodal execution monitoring system to detect and 
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classify anomalous executions for robot-assisted feeding. Specifically, the classifier 
labelled not only the anomaly type but also the cause of the anomaly by using an 
artificial neural network system. The neural net successfully integrated the anomaly 
monitoring and diagnoses for assisting a person with severe quadriplegia. However, 
due to the HMM anomaly detector limitations, the classified accuracy for anomaly 
types and causes are 90.0%, 54.0% respectively. In [41], Bjreland et al.introduced 
a monitoring system that also can detect, classify, and correct anomalous behaviors 
using a predictive model. Those two integrated system give us a lot of inspiration 
and confidence for improving the robot introspection with the anomaly monitoring 
and diagnoses. 

The anomaly diagnoses has been used to determine the source of anomalies while 
running manipulator or mobile robots [42]. Several common time series classifica- 
tion algorithms are distance based metrics, such as the k-nearest neighbors (KNN) 
approach have proven to be successful in classifying multivariate time series [43]. 
Plenty of research indicates Dynamic Time Warping (DTW) as the best distance 
based measure to use along KNN [44]. Besides the distance based metrics, the fea- 
ture based algorithms have been used over the years [45], which rely heavily on the 
features being extracted from the time series data or modeling the time series with the 
parametric methods. Hidden State Conditional Random Field (HCRF) and Hidden 
Unit Logistic Model (HULM) are two successful feature based algorithms that have 
led to state of the art results on various benchmark datasets [46]. However, HCRF is a 
high computational efficiency algorithm that detects latent structures of the input time 
series using a chain of k-nominal latent variables. Further, the number of parameters 
is linearly increasing of latent states required. To overcome this, HULM proposes 
using H binary stochastic hidden units to model 2” latent structures of the data with 
only O(#) parameters. Another approach for multivariate time series classification 
is by applying dimensional reduction techniques or by concatenating all dimensions 
of a multivariate time series into a univariate time series [47]. 


5.3 Problem Statement 


Another main task of robotic introspection is to accurately evaluate the potential 
pattern of real-time multi-modal sensing data. It is not limited to the identification 
of robot movement and abnormal monitoring, but also has the ability to classify 
conventional abnormalities. Anomaly diagnoses mainly refers to the identification 
of conventional learned abnormal types through supervised learning after the robot 
detects the occurrence of abnormalities. An intelligent robot system includes but is 
not limited to the following types of abnormalities: system abnormalities, internal 
sensors anomalies, motion instruction failures, noise or damage to sensors and actua- 
tors, robot-human-environment interactions, or external disturbances in the environ- 
ment. Park et al. [50] carried out a diagnoses of the types of robot anomalies and their 
causes, and integrated this function into a human-robot interactive robot system for 
feeding disabled people. The types of anomalies considered in this section mainly 
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come from external disturbances caused by system anomalies (kinematic anoma- 
lies, sensor failures, etc.) or human improper interference behavior or robot ends in 
a human-machine collaboration environment. In order to improve the compactness 
and portability of the proposed robotic sensing system, this section still adopts the 
method of non-parametric Bayesian hidden Markov model to implement multi-class 
anomaly diagnoses. 


5.4 Collection and Augmentation of Anomaly Sample 


The collection of anomalous samples in this book takes into account the precursory 
period and the duration of a certain period of time when the anomaly occurs, which 
is determined by the value of the given anomalous window range win_len (for ease 
of adjustment, the duration of the precursory period and duration is usually equal, 
both are half of win_len). As shown in Fig.5.1, a sample with an anomaly type 
of “tool collision” is extracted (light red background area) around the trigger time 
(black vertical dashed line) of the previous anomaly monitor. 

In addition, due to the great randomness and uncertainty of the occurrence of 
anomalies, the number of samples between different types of anomalies is extremely 
unbalanced, and for some anomalies (collision between the robot and the environ- 
ment), the impact on the robot body may even cause damage There is no guarantee 
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Fig. 5.1 An example of extracting an “tool collision” anomaly sample with a given when anomaly 
detected 
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Fig. 5.2 The implementation flowchart of K-means Algorithm 


for the number of samples for model training. In view of this problem, this article 
considers the data enhancement processing for the abnormal types of fewer sam- 
ples collected. The main goal is to generate synthetic data based on the statistical 
characteristics of the data. 

The proposed multimodal augmentation method is inspired by the K — means 
clustering method in traditional machine learning, combined with Dynamic Time 
Warping (DTW) to enhance the data. Among them, the implementation flowchart of 
the K — means clustering method is shown in Fig. 5.2. 

In the algorithm, K represents the number of categories, Means represents the 
mean. As the name suggests, this method is a method for Dimensional or multidi- 
mensional data points, and even the time series considered in this article) clustering 
algorithm, the core idea is to use a preset K value and the initial centroid of each 
category (Centroid) to distance (Euclidean distance, Manhattan distance and Time 
series distance measure (DTW) is used to cluster the data points that are closer, so 
that the average value of the clustered iteration is preferentially obtained with the 
smallest objective function value, as shown in the following formula: 


K N 
J= SOO Ik? = cjl (5.1) 


where, the symbol J is the objective function of the K — means algorithm; K is the 
number of clusters; N is the number of data points; || - || is expressed as the distance 
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Algorithm : Data Augmentation for Sparse Anomaly Class 
Input: {y,,}N°: Sparse dataset with N. data-points (sequences), cach of length Th; 
N: Number of data-points for class c € C; 
Np: For determining the number of centroids for K-means clustering 
Ny: Number of the iteration of K-means; 
Na: Number of runs for DTW averaging algorithm; 
K: Number of centroids for K-means clustering 
Result: Synthetic anomaly sample gencration: {ja}, N, >N, 
1 K & ceil(|N|/N,). ceil(x): The smallest integer greater than or equal to x 
2 while Ng > 0 do 
3 CENTROIDS +- pick K random data points from {yn }°; 
4 Assign each data point yn to the nearest CENTROID using DTW; 
5 Remove CENTROIDS with only one data-point; 
6 for CENT ROID in CENTROIDS do 
7 | CENTROID +- update the centroid using the subset of data-points which are assigned to centroid; 
8 
9 


end 

{9.}N* + append the CENTROIDS 
10 Nk 1 
u end 


Fig. 5.3 The pseudocode for data augmentation for sparse anomaly class 


function between the data points and the centroid; and c represents the centroid of 
the category. From the implementation of the K — means algorithm, we know that 
by clustering data points close to the centroid, and data points in the same category 
have similar statistical characteristics, the goal of satisfying data enhancement is to 
hope to generate similar data based on sparse dataset New data for characteristics. 
To this end, the proposed algorithm, after initializing K random centroids, loops 
through the following two steps: (1) Dividing. Divide each data point to the nearest 
centroid according to the distance measurement of DTW; (2) Update. According 
to the selected mean measurement method, the centroid is moved to the center of 
various types of designated data points. This time, each time the updated centroid is 
used to generate a composite data, the algorithm’s pseudo code is shown in Fig. 5.3. 

Meanwhile, an example of one-dimensional data enhancement for an abnormal 
type is shown in Fig. 5.4. As can be seen from Fig. 5.4, the proposed algorithm 
can effectively capture the statistical (such as peak and mean) characteristics of the 
original data. 


5.5 Anomaly Diagnose Based on Nonparametric Bayesian 
Model 


The anomaly diagnoses is triggered once an anomaly is detected. A system can 
possibly address a wide variety of types of anomalies including low-level hardware 
anomalies: sensor and actuator noise or breakage; mid-level software contingen- 
cies like: logic errors or run-time exceptions; high-level misrepresentations: poor 
modeling of the robot, the world, their interactions, or external disturbances in the 
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Fig. 5.4 An example for illustrating the data augmentation of anomaly sample 


environment). In [48, 50], Park et al.identified both the anomaly class and the cause. 
In this work, we deal with anomalies caused by external disturbances generated either 
by intrusive human behavior or resulting from poor modeling or anticipatory ability 
on the robot’s end. 


5.5.1 Mutlticlass Classifier Modeling 


We consider the problem of multiclass classification based on multivariate time series, 
that is, to find a function f(y") = c” given the i.i.d training data Y = Sig are C= 
{c"}A], where y” = [y8, cus yr,] is an observation sequence and c” € {1, .., c} 
its corresponding anomaly class. An observation yf € R’ consists of the same mul- 
timodal features as described in Sect. 4.6.1 at time-step t. Our objective is diagnoses, 
where given a new test observation sequence f, we have to predict its correspond- 
ing anomaly class ¢. Here, the } is recorded by considering a window_size around 
the anomaly occurred moment, and the ĉ is labeled manually during training proce- 
dure. We represent each anomaly class by a separate sHDP-VAR-HMM as outlined 
in Sect. 4.6.1, the ©, are the parameters for class c. It would be straightforward to 


estimate the posterior density of parameters p(@,)|Y, C) by 
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Algorithm 1: Training model for anomaly diagnosis 
Input: 
Ne: Number of sequences for class c € C; 
{Yn pe; Dataset with Ne sequences, each of length T),; 
N;: Number of the maximum iteration for learning; 
N,: Number of runs for the whole learning; 
random_state: The random number generator; 
k_splits: Number of folds; 
a,b,d,e: Hyper-prior for concentration parameters; 
v, A, V, M,sp: Hyper-prior for the MNIW distribution; 
x: The self-transition bias; 
K: The truncation active states. 
Result: sHDP-VAR-HMM models for each class 
{Yn } trains {Yn }test = KFold_split({ yn}, k_splits, random_state) 
for k in k_splits do 
for i in N, do 
for n in N; do 
if not converged then 
| Oe, loss = HDPHMM({ yp} train, a, b, d,e, v, A, V, M, Sp, K, K) 
else 
| sHDP-VAR-HMM with ©, and loss 
end 
end 
{loss} < loss 
if i N, then 
| ©, + with the minimum loss value 


end 


Ntest * ©, s 
Lisi Fy }il ) i € Neest 


k_splits 
{Leest_mean },” } < Lrest_mean 
if k == k_splits then 
| return ©, with the maximum Lies) mean 


Leest „mean — 


end 


Fig. 5.5 An example for illustrating the progress for training nonparametric Bayesian model for 
anomaly diagnoses 


P(@cl¥, C) = p(@-) [| p(y"1@c) po). (5.2) 


c"=c 


That is, each sHDP-VAR-HMM is trained separately and a conditional density p(y|c) 
for each class is trained throughout the process as defined in Fig. 5.5. 
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5.5.2 Experiments and Results 


e Anomaly Dataset in Kitting Experiment 


The dataset captures sensory-motor signatures of the Kitting task under anomalous 
scenarios as outlined in this paper. A total of 136 anomalous events were recorded in 
142 experimental executions across all skills. The proportions for each anomaly class 
were as follows: TC 15.7%, HC 16.7%, HCO 16.7%, NO 13.0%, OS 16.7%, WC 
15.7%, and OTHER 5.6%. To intuitively explain, analyze, and propose corresponding 
diagnose methods, all the anomalies identified using our proposed method are taken 
into further consideration. That is, we extract all the anomaly data points from each 
abnormal movement and concatenate them with labels, in which the anomalous data 
point is also restricted with the same features as considered in anomaly identification. 
Those identified anomalies are visualized via the t-distributed Stochastic Neighbor 
Embedding (t-SNE) method [51] and labelled manually, as shown in Fig. 5.6. 


e Anomaly Diagnoses Window Considerations and Online Recording 


Note that when an anomaly monitoring is flagged, we consider a window of time 
duration twindow_size for the subsequent anomaly diagnoses.' In cases where an 
anomaly is detected towards the beginning of a skill execution, and the duration of 
existing data prior to the detection is less than our window_size. In this work, we 
do not extract data for diagnoses as we deem a minimal presence of signals before 
and after the detection crucial for the diagnoses. The sensory-motor data collected 
at this stage, allows to build basic models of the anomalies described in Chap. 4. 

Our system recorded sensory data online. Upon detection of an anomaly, the time- 
step at which the flag occurred is recorded. Then, we record (online) the multimodal 
signatures (F/T, Cartesian velocity, and Tactile signals) before and after the anomaly 
(also referred to as the sample) according to the window_size. Signals were re- 
sampled at 10Hz to achieve temporal synchronization for all modalities and the 
further preprocessed as described in preprocessing step. 


e Parameter Settings and Results 


To avoid overfitting, we performed 3-fold cross validation for each anomaly type 
separately. An 18 dimensional feature vector was used as presented in Equation 
4.53. We compare against five baselines consisting of parametric HMMs (with a 
fixed number of hidden states), various observation models, and various variational 
inference methods. The training and testing dataset for all the diagnoses methods 
was the same in this comparison. 

(1) Parametric HMM Settings: We train 3 differing types of HMMs for each 
anomaly class. Each HMM uses four different numbers of hidden states K € 
{3,5, 7, 10} to train each class. We need to estimate the transition matrix, mean, 
and covariance parameters. To this end, K-Means++ [52] clusters the data and yields 


'We use the Redis database for this purpose (https://redis.io/) as rosbags can only be processed 
offline. 
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Fig. 5.6 Visualization of all the identified anomalies using t-SNE method. We heuristically label 
the clusters according to the analysis of unexpected anomalies and the data points density in such 
a kitting experiment 


initial estimates. During testing, we evaluate a test sample against all classes and 
select the class with the largest log-likelihood. The parametric HMMs result sum- 
mary is presented in Table 5.1. 

o HMM-Gauss-EM: A classifier based on a classical HMM with Gaussian obser- 
vations was trained independently for each anomaly class. The standard Baum-Welch 
Expectation Maximization algorithm [53] was used to learn the HMM model. 

o HMM-Gauss-VB: A classifier based on classical HMM with Gaussian obser- 
vation was trained independently for each anomaly class. The standard Variational 
Bayesian (VB) inference algorithm [54] was used to learn the HMM model. 

o HMM-AR-VB: A classifier based on a classical HMM with first-order auto- 
regressive Gaussian observations was trained independently for each anomaly class. 
The standard VB inference algorithm [54] was used to learn the HMM model. 

In conclusion, we find that for parametric HMMs (i) the best diagnoses accuracy 
rate was 95.7% when using 5-7 fixed states; (ii) Variational inference algorithms 
outperformed the standard EM algorithm; (iii) The Autoregressive observation model 
classified better than the Gaussian model due to it’s linear conditional nature; (iv) 
Parametric HMMs, in general, are less effective to jointly model the dynamics of the 
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robotic task. Therefore, we consider a Bayesian nonparametric verision of HMMs 
with a hierarchical prior which shares statistical strength across the training samples. 

(2) Nonparametric HMMs Setting: We also train 4 kinds of classifiers base on non- 
parametric HMMs, independently for each anomaly class. We specify the truncation 
number of states as K=10 as explained in Sect. 4.6.1. Comparing to the parametric 
HMMs, the actual number of hidden states of each anomaly class is automatically 
learned from data in non-parametric fashion. The same procedure during testing 
as described in parametric HMMs. Benefits from the automatic state inference with 
HDPHMM,, the auto-regressive correlation of the anomaly data, and the effective vari- 
ational inference techniques. The summary of diagnoses results of non-parametric 
HMMs are presented in Table 5.1. Those numbers in blue are denoted the optimal 
diagnoses accuracy of specific anomaly type across all the methods, respectively. 

o HDPHMM-Gauss-VB: A classifier based on HDPHMM with Gaussian obser- 
vation was trained independently for each anomaly class. The standard VB inference 
algorithm [55] was used to learn the HDPHMM model. Similiar to the method pro- 
posed in [57], instead of the blocked Gibbs sampling algorithm, we learn the model 
by variational Bayesian inference. 

o HDPHMM-Gauss-moVB: A classifier based on HDPHMM with Gaussian 
observation was trained independently for each anomaly class. A memoized online 
variational inference algorithm (moVB) [56] based on scalable adaptation of state 
complexity is used to learn this HDPHMM model. 

o HDPHMM-AR-VB: A classifier based on HDPHMM with first-order autore- 
gressive Gaussian observation was trained independently for each anomaly class. 
The standard VB inference algorithm [55] was used to learn the HDPHMM model. 

o HDPHMM-AR-moVB: Finally, our results were evaluated on the HDPHMM 
with first-order autoregressive Gaussian observation for each anomaly class. A mem- 
oized online variational inference algorithm (moVB) [56] based on scalable adapta- 
tion of state complexity was used to learn this HDPHMM model. 

Given that non-parametric sHDP- VAR-HMM learns the complexity of the model 
from the data, it produces more accurate models as is reflected by the higher diag- 
noses accuracies shown in Table5.2. The learned number of states for the different 
anomaly types is shown in Table5.3. Note that for equivalent parametric HMMs, a 
tedious model needs to be computed for each class to optimize the state cardinal- 
ity between types. The diagonal elements represent the number of points for which 
the predicted label is equal to the true label, while off-diagonal elements are those 
that are mislabeled by the classifier. It is evident from the confusion matrix that the 
diagnoses outperforms other baseline methods. 


5.5.3 Discussion 


The multiclass classifier that is flagged when an anomaly is detected, was also tested 
through the sHDP-VAR-HMM on seven anomalies and six baseline methods. Our 
evaluations showed that we could not only detect anomalies reliably (overall accuracy 
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Table 5.2 Confusion matrix for diagnoses results on kitting experiment dataset with seven anoma- 
lies by HDPHMM-AR-moVB method 


TC HC HCO NO OS WC FP 
TC 1.00 0.00 0.00 0.00 0.00 0.00 0.00 
HC 0.00 1.00 0.00 0.00 0.00 0.00 0.00 
HCO 0.00 0.00 0,95 0.00 0.05 0.00 0.00 
NO 0.00 0.07 0.00 0.93 0.00 0.00 0.00 
OS 0.00 0.00 0.00 0.00 1.00 0.00 0.00 
WC 0.00 0.00 0.00 0.00 0.00 1.00 0.00 
FP 0.00 0.00 0.00 0.00 0.00 0.00 1.00 


Table 5.3 The number of hidden states being active for different anomaly classes using HDPHMM- 
AR-moVB method. An active state is one in which at least one observation is assigned to this state 


TC HC HCO NO OS WC FP 
4 7 5 4 8 6 3 


of 91.0%, as presented in Chap. 4) but also classify them precisely (overall accuracy 
of 97.1%). 

With regards to anomaly diagnoses, anomalies usually occur from various sources 
such as sensing errors, control failures, or environmental changes. If a robot identifies 
the type, it may be beneficial to prevent or at least recover from the anomalous situa- 
tion. In our proposed diagnoses method, we trained the sHDP- VAR-HMM model for 
each anomaly type separately. To address this, Natraj Raman had proposed a signal 
discriminative HDP-HMM for all classes albeit with an extra level that is class spe- 
cific. Thus, an interesting future work direction consists in investigating a multilevel 
sHDP-VAR-HMM for all classes for multiclass classification and the extensions of 
the observation model by using higher order autoregressive Gaussian models. 

Development of robot introspection is expected to have a direct impact on a large 
variety of practical applications, such as that can be used to estimate the log-likelihood 
of failure and prevent the failure during robot manipulation task. Also, the improve- 
ment of safety in human-robot collaborative environment by assessing the quality of 
learned internal model for each skill, which can speed up the anomaly recovery and/or 
repair process by providing the detailed skill identification and anomaly monitoring. 

There are a number of limitations in our work. Currently we do not explicitly 
reduce the influence of outliers occasionally found in the derived log-likelihood val- 
ues for specific hidden states. The outliers have a significant impact on the calculation 
of the threshold and our approach needs to address them specifically to avoid their 
impact. Additionally, we note the fact that our kitting experiment was not conducted 
under real factory conditions or in a real household daily task. Thus the verifiability 
of the work in real-world settings is unclear and further testing in real-factory condi- 
tions is necessary. The kitting experiment provides a proof-of-concept and the authors 
would like to extend their work to actual scenarios through corporate partners. 
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Signals synchronization and preprocessing of different modalities 
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Features Extraction from time, frequency domain, respectively 


oe ae 


Feature significance evaluation using hypothesis testing 


} 


Fig. 5.7 The implementation procedure of sparse representation 


5.6 Anomaly Classifier Based on Feature Extraction 


The procedure of the sparse representation system for multimodal time-series is 
shown in Fig. 5.7. 


5.6.1 Anomaly Sample Collection 


e Sensory Preprocessing 


The original multimodal sensory data includes a 6 DoF force and torque signals from 
F/T sensor, a 6 DoF Cartesian velocity signal from a manipulator’s end-effector, 
56 DoF tactical signals from a left and a right tactile sensor panels. To produce 
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more consistent data content, we do not directly concatenate individual data sources, 
instead temporal synchronization is done across modalities at 10H Z. Also, different 
preprocessing techniques are done for specific modalities. 

Wrench modality: Takes a force and torque time-series vector sequence and 
for each element represents the magnitude of each dimension (fx, fy, fz» tx, ty, tz). 
Empirically, we wish that anomalies HC and TC can effectively flag external pertur- 
bations caused from different directions. We also consider the norm of force n ¢ and 
torque n, respectively: 


ne = A te ae es ct m= gn re (5.3) 


Velocity modality: We measure the linear (/,,/),1,) and angular (ax, ay, az) Carte- 
sian velocity (the endpoint state of a Baxter right hand), which are reported with 
respect to the base frame of the robot. As with the wrench source, we also consider 
the norm of the linear velocity n; and angular velocity n, respectively. 


n = BREDH E, Ng = az +a? +a? (5.4) 


Taxel modality: Due to the high dimensionality of tactical sensor, we do not process 
all the original signals as the model input. After empirical testing, the standard 
deviation of each tactile sensors s;, s, were selected as the preferred features and 
defined as: 


28 


1 
s= |z Gi - md, s 


i=1 


(5.5) 


where the u; = x Da l; and u, = x ye r; is the mean of each tactical panel, 
respectively. 

The above three preprocessing operations capture unexpected changes in the orig- 
inal signals since the signals are slightly different from the robot variable progress 
in normal executions. We then concatenate all the features to represent robot execu- 
tions both in nominal and anomalous cases. Note that raw concatenated features can 
easily result in high False Positive Rates (FPR) during execution due to significant 
task execution variability across the same task in our experiments. For instance, the 
F/T signals vary different across differing objects of different weights; or similarly, 
human collisions which occur from different directions and magnitudes. 

A standardization method is used to scale original signals £, by its mean and 
standard deviation according to: 


_ So(*) — mean (Eo (*)) 


569) std(E) 


{fos Ay, +51, Sp}. (5.6) 


Finally, our eighteen dimensional multimodal signal(wrench, velocity, and tactical 
modalities) is represented as: 
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Yr = E), Py), Ee), E), Ey), Elt), 
Elp), E), Ell), El), E), 6 (ax), (5.7) 
é (dy), € (az), (m1), E (na), € (81), 6 (Sr). 


5.6.2 Anomaly Features Extraction 


In order to keep temporal consistency, anomaly signals are considered for a given 
window_size which is fixed around events flagged as anomalous. For instance, if 
a tool collision is detected as shown in Fig.5.8, then window_size can evaluate 
how the diagnoses reactivity performs in our system. Generally, the window_size 
will equal a power of two (the a preferred size when including the Fast Fourier 
Transformation (FFT) between the time and frequency domain). 

As described above, the sparse representation is applied for each extracted sample 
window. According to our previous work on anomaly diagnoses [20], the features are 
extracted in both the time domain and frequency domain. For anomaly diagnoses, 
we empirically consider the independent features and the corrective features along a 
specific modality signal as in Eq.5.7. Here, the original multimodal signals with 18 
DoFs and 12 feature types are considered in both the time and frequency domains, 
where the final feature vector is of length 558. 

A: Independent features 
Here, we calculate the features along a specific modality signal E {x} = (x1, X2, ..., Xn) 
with n data points 


Multimodal Signals of Tool Collision 


3 | ---- anomaly detected 
© win _len=ls 
1 win_len=2s 
2 | win_len=3s 


Time(s) 


Fig. 5.8 Our robot introspection system for extracting the anomaly data when anomaly detected. 
For instance, the data of tool collision is represented with a given window_size = +2s in red 
background 
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1. 


mean 
HS 68) 
= — Xi > 
: GAEE 
. standard_deviation 
A Vw we (5.9) 
aa 
. mean_diff: calculate the mean over the differences between subsequent values 


n—-1 


1 
haist = — 9 Gist = xi) (5.10) 


i=1 


. mean_abs_diff: calculate the mean over the absolute differences between subse- 


quent values 


n-1 
1 
Mabs_diff = 7 X leii mal (5.11) 
Ei 


. abs_energy: calculate the absolute energy, that is the sum over the squared values 


eas = X x? (5.12) 
i=1 


: Correlative features 


. autocorrelation: calculate the autocorrelation of the specified lags k € {1, 2, 3, 4} 


given the mean u and variance o”, respectively. 


1 n—k 
EEE de U) = U) (5.13) 


. mean_autocorrelation: calculate the mean of the autocorrelation which taken over 


all possible lags / € {1, ..., n} 


1 n 
uP) = —— D (5.14) 


. std_autocorrelation: calculate the standard deviation of the autocorrelation which 


taken over all possible lags / € {1, ..., n} 
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1 n 
oP) = | VA -MD (5.15) 
l=1 


. ar_coefficient: get the first 5 coefficients by fitting the unconditional maximum 
likelihood of an autoregressive model </#(p) with order p. In case of our imple- 
mentation, the order is a fixed p = 10. The “@&(5) model is defined as 


p=10 
x= po + >> Pixi + £r (5.16) 


i=1 


where €, is drawn from a Gaussian white noise with a mean of zero and unit 
variance. 

. partial_autocorrelation: calculate the value of partial autocorrelation function of 
given lag k € {1, 2, 3, 4, 5}, denoted æ (k), is the autocorrelation between x, and 
Xı+k With the linear dependence of x, on x,4; through x,4,—; removed. As such, 
the function is defined as 


Cov(Xr, Xt-k) = Cov (Xr, Xt—K| X11, «+s Xt-k+1) 
var (X;) = Var (X;|X1-1, «++ Xt—k+1) 
var (X-k) = Var (XK 1X11, «+s Xt-k+1) (5.17) 


Cov(x;, X1-K) 


a= / var (xi )var (X-k) 


: Spectrum-based features 


. fft_coefficient: Get the first 5 Fourier coefficients of real part by FFT algorithm, 
respectively. In this implementation, we define the discrete Fourier transform 


function as 
n—-1 mk 
F= Y dm exp{—2n —}, k=0,...,.n-1 
m=0 1 


(5.18) 
am = exp{2rifmAt} 
where, At is the sampling interval. 


. fft_angle: Calculate the angle of obtained complex value F of the first 5 Fourier 
coefficients, respectively. 


Fima 
0 = tan`! —t (5.19) 


real 
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5.6.3 Experiments and Results 


e Experimental Setup 


A kitting experiment consists of 5 basic nodes: Home, Pre-pick, Pick, Pre-place, and 
Place. The experiment is implemented in the following order with those nodes by 6 
skills with the ROS-SMACH,? (Skill 1): Home > Pre-pick; (Skill 2): Pre-pick > 
Pick; (Skill 3): Pick — Pre-pick; (Skill 4): Pre-pick — Pre-place; (Skill 5): Pre-place 
— Place; (Skill 6): Place —> Pre-place, as shown in Fig. 4.11. 

The primary goal of the kitting task is designed to transport 6 different objects 
to a fixed container. The right arm of Baxter humanoid robot is used to pick objects 
and equipped with a 6 DoF Robotiq FT sensor, 2 Baxter-standard electric pinching 
fingers. Each finger is further equipped with a multimodal tactile sensor that a four by 
seven taxel matrix that yield absolute pressure values. The left hand camera is placed 
flexibly in a region that can capture objects with a resolution of 1280 x 800 at 1 fps 
(we optimize pose accuracy and lower computational complexity in the system). The 
use of the left hand camera facilitated calibration and object tracking accuracy. All 
code was run in ROS Indigo and Linux Ubuntu 14.04 on a mobile workstation with 
an Intel Xeon processor, 16GB RAM, and 8 cores. 

When robot collaboratively works with human in a shared workspace, so many 
external disturbances are likely to occur. Those anomalies in Fig. 4.12 are considered 
in the Kitting experiment, which including the following 7 types: Tool Collision 
(TC) that may be derived from the visual error or the user accidentally collide with 
the object during robot moving to grasp it (see Fig.4.12b); Human Collision (HC) 
is usually happened by a user to unintentionally collide with the robot arm in the 
human-robot collaboration environment(see Fig. 4. 12a). We treat the human collision 
differently from whether the robot carrying object or not. Thus, Human Collision 
with Object (HCO) is assumed that the human collision while robot carrying object 
from the node Pre — pick to Pre — place (see Fig.4.12d). The object have been 
knock down by the robot during grasping may induce the No object (NO) or missed- 
grasps(see Fig. 4.12f). Another common anomaly is Object Slip (OS) that the picked 
object may slip from the robot’s gripper if the grasping pose is not optimal or the 
robot moves at high speed. Finally, the False Positive (FP) is labeled when some 
unexpected disturbances may be detected by the anomaly detector for a variety of 
reasons, for instance, the system error, the object is placed at unreachable zone, 
without feasible inverse kinematic solution, and so on. In the rest of this paper, we 
intentionally achieve the spare representation of the seven types of anomalies while 
preserved sufficient diagnoses accuracy, respectively. 


e Results and Analysis 


The dataset contains a total of 108 samples from 137 experimental recordings of 
kitting task and the proportion for each anomaly are TC 15.7%, HC 16.7%, HCO 
16.7%, NO 13.0%, OS 16.7%, WC 15.7%, FP 5.6%, respectively. For evaluating the 


7http://www.ros.org/SMACH. 
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Table 5.4 Multiclass classifiers 


Index Classifier Comment 

1 BernoulliNB Binarizing the dataset with the boolean threshold u of 
each madality axis, which indicates that feature values 
below or equal to the threshold are replaced by 0, 
otherwise, by 1. The smoothing parameter œ = 1.0 and 
with a uniform prior on the class distribution 

2 GaussianNB Default settings 

3 DecisionTree A non-parametric supervised learning method used for 
diagnoses. A major advantage is that it does not require 
huge data preparation 

4 RandomForest Default settings 

5 LinearSVC Default settings 

6 LogisticRegression | Default settings 

7 SGDClassifier Preprocessing procedure is added to convert the dataset 
with zero mean and unit variance 

8 RidgeClassifier Default setting 

9 KnnDtw n_neighbors=2 and max_warping_window=10 


performance of the proposed sparse representation of multivariate time-series, we 
took the following 9 representative classifiers*into consideration and the parameter 
settings are described respectively in Table 5.4. 

As described above, the feature selection is a process where the most significant 
features in predicting the outcome are selected automatically. However, irrelevant 
features decrease the model’s accuracy and increase computational cost. We there- 
fore calculate the p_value of each extracted feature by using the hypothesis testing 
method in spscitefresh2016. That is, we preform a singular statistical test checking 
the hypotheses for each extracted feature fi, fo, ..., fns 


Hj, = {x; is irrelevant for predicting class y}; (5.20) 
Hİ = {x; is relevant for predicting class y}; . 


The result of hypothesis test in Eq. 5.20 is a p_value, which assess the probability 
that feature x; is not relevant for predicting class y. As such, we define the score of 
feature by calculating the negative logarithmic value on the p_value. Large scores 
— log(p_values) indicate features, which are relevant for predicting the target. 

The performance of different classifiers is shown in Fig. 5.9, as you can see (read- 
ing left to right on the graph), the accuracy indicates to increase as the number of 
features are added, until a point beyond which there seems to be too few features 
for the classifier to make any reliable conclusions. Specifically, those features are 
extracted from a sorted feature vector in descending order by score values. 


3http://scikit-learn.org/. 
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5 The comparsion of accuracy and the number of features for different classifiers 


Classification accuracy 


0 100 200 300 400 500 600 
The number of features 


BernoulliNB =—s RandomForestClassifier e~e SGDClassifier 
GaussianNB e—e LinearSVC =s—s RidgeClassifie 


DecisionTreeClassifier =—a LogisticRegression e e KnnDtw 


Fig. 5.9 The comparison between diagnoses accuracy and the number of feature on different 
classifiers 


5.6.4 Discussion 


This work implements sparsely represent the recorded multimodal time-series with 
relevant features as small as possible while preserving diagnoses accuracy. We pro- 
pose the multivariate features are extracted in both time domain and frequency 
domain and not only consider the static statistical characteristics, but also including 
the correlation and interaction of each dimensional sensory signal. Results indicate 
that the data set can be significantly reduced up to 72.2% ~ 86.1% (the number 
of features is 100 and 200, respectively) of the raw data while keep the average 
diagnoses accuracy at about 85% with small data preparation. Future work should 
therefore include analyzing the trade-off between the value window_size and the 
diagnoses accuracy. So as to represent the recorded multimodal time-series with 
relevant features as small as possible while preserved diagnoses accuracy. 


5.7 Summary 


In this chapter, anomaly diagnose methods using the Bayesian nonparametric hid- 
den Markov models and sparse representation by statistical feature extraction when 
anomaly triggered are introduced. Specifically, the detail procedure for anomaly sam- 
ple definition, the supervised learning dataset collection as well as the data augmen- 
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tation of insufficient samples are also presented. The proposed methods is verified 
with a multi-step human-robot collaboration objects kitting task on Baxter robot, the 
performance and results are presented of each method respectively. That is, a multi- 
target classifier based on the nonparametric Bayesian model is proposed, Which 
using the sHDP-VAR-HMM model to model the anomaly sample data of various 
anomaly types. For the task of Baxter robot performing kitting experiment, resulting 
in the diagnoses accuracy of a total of 7 types of anomaly events in is 97.1%. Addi- 
tionally, for the sparse representation for anomaly diagnoses, we extract a total of 31 
features of time and frequency domains of the anomaly sample. Then, the extracted 
features importance analysis using the hypothesis testing method, after that the fil- 
tered features are verified on 9 common out-of-the-box supervised learning methods 
for multi-class diagnoses in sklearn. 
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Chapter 6 A) 
Learning Policy for Robot Anomaly geit 
Recovery Based on Robot Introspection 


Abstract In this chapter, the anomaly recovery would be acted when both of the 
anomaly monitoring and diagnoses are analysed, which aim to respond the external 
disturbances from the environmental changes or human intervention in the increas- 
ingly human-robot scenarios. To effectively evaluate the exploration, we summarize 
the anomalies in a robot system include only two catalogues: accidental anomalies 
and persistent anomalies. In particular, we first diagnose the anomaly as accidental 
one at the beginning such that the reverse execution is called. If and only if robot 
reverse many times (not less than twice) and still couldn’t avoid or eliminate the 
anomaly, the human interaction is called. Our proposed system would synchronously 
record the multimodal sensory signals during the process of human-assisted demon- 
stration. That is, anew movement primitive is learned once an exploring demonstra- 
tion acquired. Then, we heuristically generate a set of synthetic demonstrations for 
augmenting the learning by appending a multivariate Gaussian noise distribution with 
mean equal to zeros and covariance equal to ones. Such that the corresponding intro- 
spective capabilities are learned and updated when another human demonstration is 
acquired. Consequently, incrementally learning the introspective movement primi- 
tives with few human corrective demonstrations when an unseen anomaly occurs. 
It is essential that, although there are only two different exploring strategies when 
anomaly occurs, numerous exploring behaviors can be generated according to dif- 
ferent anomaly types and movement behaviors under various circumstances. 


6.1 Introduction 


In recent years, with the widespread promotion of the cooperative robot and human 
beings’ inclusive operation, the robot’s abnormal recovery behavior cannot be per- 
formed by the traditional robot’s own motion planning algorithm, and human expec- 
tations for robot motion should be brought into play. The recovery strategy will 
reflect the human-centered concept of human-machine collaboration. For this rea- 
son, the robot anomaly recovery problems considered in this article mainly refer to 
the recoveries learned from humans by themselves in the case of robots that can 
recovery anomalies (not considering abnormal situations that cannot be recovered 
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autonomously such as power interruptions or system crashes). The strategy makes 
the robot’s motion change from an abnormal state to a normal state. In other words, 
how to learn from humans to obtain corresponding recovery strategies for different 
abnormal types of robots is the key issue of this chapter, and it is required that the 
proposed recovery strategies have certain expansibility and generalization. 


6.2 Related Work 


To address the unpredictable abnormalities in robot operation tasks in an unstructured 
dynamic environment, Gutierrez et al. [1] proposed a way of expanding the original 
operation graph to recovery unforeseen abnormal events. The overall idea of the 
method is to use finite states machine learns an initial robot operation diagram for 
limited human demonstration movements, then re-plans the robot’s movement when 
there is an abnormality in the actual operation application, and adds this recovered 
movement to the operation diagram, as the operation diagram continues Perfect, so 
that we can always find a feasible recovery behavior for the abnormal event. Paster 
et al. [2, 3] proposed the establishment of a specific library of motion primitives for 
the robot’s operation tasks, and taught the different ways to complete the task through 
human experience. After combining the methods of motion primitive selection, the 
abnormal premise was detected Next, the next primitive corresponding from the 
motion primitive library is completed [4, 5], thereby completing the behavior of 
abnormal recovery. Different from the robot’s own method of motion conversion 
and selection, Salazar-Gomez et al. [6] proposed to implement robot’s abnormal 
recovery by introducing human observation and control [7]. This method is a kind 
of human-in-the-loop (Human-in-the-loop) human-robot interaction. This has the 
advantage that humans have a more intuitive understanding of the robot’s movements. 
When abnormal or error occurs, they can think about feasible solutions in time. 
Such recovery behavior is more Vulnerable to researchers and more conducive to 
applications in the human-machine environment. Niekum et al. [8] also described the 
operation tasks of the robot through the finite state machine. In the case of detecting 
an abnormal event, humans assist the robot to complete the current sub-task through 
kinematic teaching. 

With associating this recovery behavior with the type of anomaly and update it to 
the original operation diagram, so that the robot can adopt a specific recovery behavior 
in the face of different anomalies. Mao et al. [9] proposed a method of human-assisted 
learning to implement robot abnormal recovery, where the robot’s movement from 
the initial motion primitive will collide with the environment. At this time, the human 
manual Stop the robot and start the teach and teach mode to re-teach the recovered 
motion primitives for the robot to complete the abnormal recovery. Karlsson et al. [10, 
11] proposed an online method of dynamically modifying robot motion primitives to 
achieve anomaly recovery. This method also pre-parameterized the robot’s motion 
by DMP. In the event of an abnormality, it is indicated by human motion Teach 
ways to learn recovery behavior. In [12], the authors proposed the abort and retry 
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behavior in grasping by minimizing the overhead time as soon as the task is likely to 
fail. As an alternative, Johan S. Laursen et al. introduced a system for automatically 
reversed in robot assembly operations using reverse execution, from which forward 
execution can be resumed [13]. Similarly, a recovery policy by modelling reversible 
execution of robotic assembly is proposed in [14]. Another perspectives for anomaly 
recovery, Arne Muxfeldt et al. proposed a novel approach for recovering from errors 
during automated assembly in typical mating operations [15—17], which is based on 
automated error detection with respect to a predefined process model such that a 
recovery strategy can be selected from an optimized repository. 

In summary, the use of human-machine collaboration to demonstrate abnormal 
recovery behavior has been favored by a large number of researchers, because in daily 
life or human-machine collaboration environment, the abnormal recovery behavior 
of robots cannot be achieved by the traditional robot itself. The motion planning 
algorithm should be carried out, and human expectations for robot motion should 
be brought into play. The human-assisted robot abnormality recovery strategy will 
further reflect the “human-centered” human-machine collaboration concept. In view 
of this, based on anomaly monitoring and classification, this paper proposes two 
types of recovery strategies for accidental and persistent anomalies, and different 
strategies will have different recovery behaviors under different types of anomalies. 


6.3 Statement of Robot Anomaly Recovery 


Specifically, the task segmentation, anomaly monitoring and diagnosis can be imple- 
mented through the generative methods such as the Bayesian non-parametric time 
series model that described in Chaps. 3-5 respectively. With the help of the task 
generalization and introspective capabilities for each movement primitives, two task 
exploration policies are proposed for responding the anomalies: reverse execution or 
human interaction. These policies can be learned from the context of manipulation 
tasks in an increasing manner. An overview of our SPAI framework can be described 
by a graph ¥Y composed of N, behaviors (or sub-tasks) 4, which are intercon- 
nected by edges & such that Y¥:{B,&}, B={BH,.., Bn}, Gr ={5, 0 15,6 € 
B}. Behaviors in turn consist of nodes -V and edges &, such that: Z = {.V, E}. 
Nodes can be understood as phases of a manipulation task. In our work, we prefer 
to name them milestones -4% = (1, ..., Nr), as they indicate particularly important 
achievements in a task. With task exploration behaviors, we may introduce a explor- 
ing nodes ⁄; in-between milestones creating a new branch to the subsequent mile- 
stone. It is also possible to introduce further exploring nodes .%;, on already existing 
branches. The full set of nodes in a task then is described as the union of milestone 
nodes with all branched nodes VY = {4% U-%U-...U-%j..q}. Node Transitions 
J , behave as with behavior transitions &, so .%, = {(s, t) : s,t € M}. In our frame- 
work, a node is composed of modules. Modules are dependent processes that play a 
key role in the execution of a manipulation phase and could include demonstration 
collection, segmentation, movement learning, introspection, task representation and 
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exploration, vision, natural language processing, navigation, as well as higher level 
agents. In this chapter, we restrict a node to segmentation, generalization, introspec- 
tion and exploration modules. 


6.4 Learning and Application of Anomaly Recovery Policy 


6.4.1 Learning for Unstructured Demonstrations 


As we know, a sentence in language is made up of words according to grammatical 
rules, and a word is made up of letters according to word formation. Correspond- 
ingly, a complex and multi-step robot manipulation task can be represented as a 
sentence, in which a coupled robot movement primitive is equivalent to a word, and 
the robot movement primitive of each degree of freedom (DoF) of can be consid- 
ered as letter such that the robot manipulation task can be represented with a set 
of movement primitives. To this end, how to effectively learn a multi-functional 
movement primitive is a critical problem for intelligent robot performing complex 
tasks in unstructured environment. If so, that is an interesting idea for generalizing 
the task representation with respect to the external adjustment as well as improv- 
ing the diversity and adaptability of tasks. As a consequence, the task is consist of 
sequential movement primitives, which not only take the kinesthetic variables into 
consideration but also equip with introspective capacities such as the identification of 
movement, anomaly detection and anomaly diagnoses. We now introduce a Bayesian 
nonparametric model for robust learning the underlying dynamics from unstructured 
demonstrations. 


e Demonstrations 


Generally, capturing the demonstrations by receiving the multimodal input from the 
end-user, such as a kinesthetic demonstration. The only restriction on the multimodal 
data is that all the signals must be able to be synchronously recorded at the same 
frequency, i.e. temporal alignment. Additionally, the multimodal data at each time 
step should include the Cartesian pose and velocity of the robot end-effector (in case 
of object-grasping, will along with any other relevant data about the end-effector, 
e.g. the open or closed status of the gripper and the relative distance between the 
end-effector and object.) as well as the signals from F/T sensor and tactile sensor. 
Subsequently, the recorded Cartesian pose and velocity trajectory of the end-effector 
will be referred to as the kinematic demonstration trajectory for controlling the robot 
motion, and the recorded signals of F/T sensor and tactile sensor are applied for 
learning the introspective capacities. 


e Learning from Demonstration 


Learning from demonstration (LfD) has been extensively used to program the robots, 
which aiming to provide a natural way to transfer human skills to robot. LfD is 
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proposed by simply teaching a robot how to perform a task as human-like, in which 
users can demonstrate new tasks as needed without any prior knowledge about the 
robot. However, LfD often yields weak interpretation about the environment as well 
as the task always is single step and lab-level such that lacks of robust generalization 
capabilities in dynamic scenarios, especially for those complex, multi-step tasks, e.g. 
human-robot collaborative kitting task designed in this paper. 

For this reason, we present the powerful algorithms that draw from recent advances 
in Bayesian nonparametric HMMs for automatically segment and leverage repeated 
dynamics at multiple levels of abstraction in unstructured demonstrations. The dis- 
covery of repeated dynamics provides discriminating insights for understanding the 
task invariants, high-level description from scratch, and appropriate features for the 
task. In this paper, these discoveries could be concatenated using a finite state rep- 
resentation of the task, and consisted of movement primitives that are flexible and 
reusable. Thus, this implementation provides robust generalization and transfers in 
complex, multi-step robotic tasks. We now introduce a flowchart which integrates 
three major modules that critical for implementing the complex task representation 
from unstructured demonstrations. 


e Segmentation 


The aim of the segmentation is exploring the hidden state representation of the demon- 
strations, in which a specific hidden state usually denotes the clustering observations 
would be sampled from a statistical model, e.g. multivariate Gaussian distribution. 
Hidden state space modeling of multivariate time-series is one of the most impor- 
tant tasks in representation learning by dimensional reduction. In this work, we 
propose the segmentation approach is a hidden state determined with Bayesian non- 
parametric hidden Markov model, which leads to tackle the generalization problem 
in a more natural way to meet the need of real-world applications. 

As we discussed above, the HDP-VAR-HMM interpreted each observation y, by 
assigning to a single hidden state z,, where the hidden state value is derived from 
a countably infinite set of consecutive integers z; = k € {1, 2, 3, ..., K}. We denote 
O, to represent all the parameters of the trained HDP-VAR-HMM from nominal 
demonstrations, including the hyper-parameters for learning the posterior and the 
parameters for the observation model definition. 


z = arg max p(Z;|),, Os). (6.1) 
1p., K 


Here, the z would be a variable value, that is, we can concatenate the derived hidden 
state sequences and group the starting and goal observations for each sequence, 
respectively. 

Assume that we record N multimodal demonstrations via kinesthetic fashion and 
jointly model them using the HDP-VAR-HMM, result in N hidden state sequences 
F={Z, B,..., Zy, }, where the element of &, is integer. Here, the problem is to 
segment the demonstrations into time intervals associated with individual movement 
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primitives by state-specific way. After the complex task segmentation, we achieve 
the task representation as presented in Chap.3 using Finite State Machine (FSM) 
and Dynamical Movement Primitives (DMP). 


6.5 Reverse Execution Policy for Accidental Anomalies 


Reverse execution allows the robot retry the current movement or several movements 
that independent with the current state for resolving the accidental faults, such as 
human collision and mis-grasp. The key question is how far back must we revert 
in the task? To address this, we are currently evaluating different methodologies to 
learn reverse policies. Ideally, the performing critic is able to include all task-relevant 
information: the state of the robot, the state of the environment, the affordances of 
a task, and the relationship between these elements. However, this is not trivial, we 
are studying whether we could integrate decision making processes from multiple 
users. It’s also difficult to measure the motivation of users to select a given node. 
Human users might have key awareness of the task that may render them select nodes 
for different reasons. Expected Utility is an area of study in Risk management [18]. 
An utility probabilistic model is designed to reflect a users intrinsic motivations, 
not limited to utility, risk propensity, and the influence in learning within a single 
decision episode, or across episodes. We expect to present preliminary results at the 
workshop. 

An intuitive illustration of reverse execution is presented in Fig.6.1. We assume 
the robot performs the current behavior B; = {-%', My and an accidental fault F, 
is detected, where .%' and A, indicate the starting and goal node, respectively. Sub- 
sequently, a new exploring node named , for responding this fault is autonomously 
appended to original task graph ¥ as illustrated in Sect. 6.3, that is 


re TB, BAF, Bi (6.2) 


Fig. 6.1 Exploring the task via reverse execution when a fault event occurs 
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As formulated in Eq. (6.4), the symbol Z, = {.%*, M} denotes a optimal way that 
consists of one or more selected movement primitives for retrying under current 
fault type and behavior situation. So, the critical problem is how to parameterize 
the transition probability p( 72, 2,) given F, and Z; when more than one ways to 
explore. To address this, a statistical distribution is introduced, which the instances 
are independent identically distributed and belong to a discrete distribution as well 
as the sum of the transient probabilities of all the samples after a fault occurrence is 
equal to 1. For these reasons, we define the p(.%z,_g,) is modelled with a multinomial 
distribution that the random variables is the respective frequency counted based on 
human intention when a fault occurs. For example, Nz, = (N1, N2, ..., Ng) is a 
frequency distribution of a random variables vector, where K represents the total 
number of movement primitives that from beginning to current movement 4; with 
fault F, (including the 4;) and N;,i € {1, 2, ..., K} denotes how many times of 
movement %; is successfully executed. Therefore, the probability mass function of 
transition p(.7z,,g@,) is formulated as a multinomial distribution that model a total 
N= SS N; reverse executions, that is, 


N Én 
p(N|N, 8) = a [14 i (6.3) 


i=l 


where, 0; indicates the probability of movement primitive %; is selected, which sub- 
ject to 0; € [0, 1] and a 6; = 1. Therefore, we use the multinomial distribution 
not only to intuitively depict the expectation of human intention on the recovery 
behavior when an abnormal occurrence is detected in human robot interaction sce- 
narios, but also to provide an indirect way to express the intuitive understanding of 
human for expecting the motion of the robot end-effector, the related manipulation 
objects as well as the complex relationship between “human-robot-environment”’. 


6.6 Human Interaction Policy for Persistent Anomalies 


Human interaction allows the robot explore the current movement by human-assist 
demonstration for resolving the persistent faults that can’t be restored by reverse exe- 
cution, such as tool collision and wall collision. The key question is how to fast capture 
and learn from the interactive demonstration base on the defined task representation. 
This exploring policy is activated when the system fail more than two times when 
reverse executions is carried out for resolving the fault Y,. The human interaction 
policy is an another exploring way that through human assisted demonstration when 
robot encounter a persistent fault. It’s essential that synchronously capture the human 
kinesthetic demonstration that not only the kinematic variables but also the relative 
coordinate frame (updating the task structure defined in Sect.6.3 by understating 
the transformation relationship between demonstration and original movement) at 
each time step. After that, the movement is formulated by the DMP techniques intro- 
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Fig. 6.2 Exploring the task via human interaction when a fault event occurs 


duced in Chap. 3. In particular, the human interaction policy is not only limited to 
the originally designed movements, but also can be applied to the exploring move- 
ment (reverse execution or human interaction). Theoretically, our system can handle 
any kinds of failure situation with those aforementioned two exploring policies such 
that potentially achieve the long-term autonomy during robot manipulation task. 
Additionally, experiential verification indicates that another problem arises, there 
are faults also exist in reproducing the human interaction. To address this problem, 
we introduce a data augmentation method (without detail statement in this paper) for 
training a new exploring movement and introspective model (described in Chap. 4) 
when the same persistent fault is continuously encountered more than three times 
in the same (multimodal signals are recorded at each time). Consequently, we can 
achieve the critical behavior for incrementally addressing the faults by “exploration 
of exploration” and “anomaly monitoring and diagnose in exploration”. 

Without loss of generality, an intuitive illustration of human interaction is pre- 
sented in Fig.6.2. We assume the robot performs the current behavior #4; = 
LM, Ng } and an accidental fault F, is detected, where AN, and Mg indicate 
the starting and goal node, respectively. Subsequently, a new exploring node named 
a for responding this fault is autonomously appended to original task graph as 
illustrated in Sect. 6.3, that is 


A: IB; Bril Fy, Bj (6.4) 


where, 2, indicate the human interactive demonstration 4, = DA ; NP), that 


have the same starting node N,! with movement Z j and a new goal node N is 
derived from demonstration. Equipping this exploring movement 2, by formulat- 
ing it using the dynamical movement primitive. Additionally, we define the goal 
pose (end-effector or joint variables) of demonstration as P that only task kine- 
matic variables into consideration for task exploration. The P can be adapted by the 
transformation relationship between .%/ and he , then Nl derived by 


HPD a (6.5) 
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(a) Deficient movement 


(c) Resulting movement with self-adaptation 


Fig. 6.3 Illustrates the human interactive demonstration in our kitting experiment when the robot 
collides with the packaging box during transporting an object 


As Fig. 6.3 shown a human robot collaborative task is designed in latter experi- 
mental verification, where a Baxter robot encounter a persistent fault (wall collision) 
during transporting object from human-over. In this situation, the robot likely to 
encounter a fault along the deficient movement (as shown in Fig. 6.3a, and the fault 
can’t be eliminated by reverse execution because of the fixed obstacle (box) on the 
right hand side of robot. An modulated trajectory should be introduced for updating 
the original task representation, as shown in Fig.6.3b a human interactive demon- 
stration for exploring the failure task. Subsequently, a transformation in Eq. (6.5) is 
derived by learning from interactive demonstration, which can be explored in a new 
scenario when encounter a same fault, as shown in Fig. 6.3c. 


6.7 Experiments and Results 


6.7.1 Platform Setup 


To evaluate the proposed method for incremental learning the introspective move- 
ment primitives. We designed a HRC task for picking and placing object into a 
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Fig. 6.4 An autonomous kitting experiment by Baxter robot: The robot is designed to transport 6 
marked objects with variable weights and shapes to a container, where the external anomalies may 
arise from accidental collisions (human-robot, robot-world, robot/object-world), mis-grasps, object 
slips So as to identify those unexpected anomalies, the robot arm is integrated with multimodal 
sensors (shown in bottom-right), including internal joint encoders, F/T sensor, tactile sensor 


container using Baxter robot and integrated the ROS Indigo! and Moveit as the 
middle-wares in our system. Specifically, a human co-worker is tasked to place a 
set of 6 objects marked with Alvar tags? on the robot’s reachable region (located in 
front of the robot) in a one-at-a-time fashion. The objects may accumulate in a queue 
in front of the robot once the first object is placed on the table, the robot’s left arm 
camera identifies the object and the robot’s right arm picks and places it in a container 
located to the right of the robot, as shown in Fig. 6.4. Multiple sensors were installed 
for effectively sensing the unstructured environment and potential faults in such a 
kitting experiment. Here, the right arm of Baxter robot is equipped with a 6 degrees of 
freedom (DoF) Robotigq F/T sensor and 2 Baxter-standard electric pinching fingers, 
where each finger is further equipped with a multimodal tactile sensor composed of a 
4 x 7 taxel matrix that yields absolute pressure values. In addition, Baxter’s left hand 
camera is placed flexibly in a region that can capture objects in the collection bin 
with a resolution of 1280 x 800 at 1 fps (we are optimizing pose accuracy and lower 
computational complexity in the system). The use of the left hand camera facilitated 
calibration and object tracking accuracy. After there aforementioned integration, the 
robot picks each object and transports it towards the container, after which, the robot 
appropriately places each of the six objects in different parts of the container, several 
snapshots are visualized in Fig. 6.5. 

In consideration of the redundant features would aggravate computational effi- 
ciency and increase false-positive rate (fault occurs even when robot’s movement is 


‘http://wiki.ros.org/indigo. 
7https://moveit.ros.org/. 
3http://wiki.ros.org/ar_track_alvar. 
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Fig. 6.5 The Snapshots of human-robot kitting experiment with a Baxter robot. We subjectively 
assume this task consists of 5 individual movements, including a Home — Pre-pick; b Pre-pick > 
Pick; c Pick — Pre-pick; d Pre-pick — Pre-place; e Pre-place — Place; f Reset 
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Fig. 6.6 Multimodal signals in kitting task, where the grey represents the invalid robot movements 
including the adjustment, orientation alignment, reset, etc, other colors are empirically indicated an 
potential movement primitive along with obvious signals fluctuations 


normal), we perform empirical features extraction on the original observation vector 
to improve identification performance. Specifically, we compute the norm of both the 
force n ¢ and the moment nn as features in wrench modality, take the norm of both the 
Cartesian linear n; and angular n, velocities in velocity modality, and consider the 
standard deviation for each tactile sensor s; and s,. Therefore, our feature vector y; of 
length 6 and formulated as y; = [^ f, Nm, 1, Na, S1, Sr], evolving extracted features 
of ten nominal executions are illustrated in Fig. 6.6. 
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6.7.2 Parameter Settings for Anomaly Monitoring and 
Diagnose 


We need both qualitative and quantitative analysis of the proposed method. To evalu- 
ate the whole performance of IMPs in unstructured scenarios with unexpected faults 
that including both of the accidental and persistent causes. We organized 5 partici- 
pants as collaborator (one expert user who confidently know this implementation and 
other four novice users) in our designed kitting experiment. Novice users first learned 
from the expert to induce fault during robot executions, which would aggravate the 
external uncertainty and increase the modeling difficulties. During data collection, 
each participant performed 1 nominal and 6 executions that at least one fault event 
by placing the set of 6 household objects in a one-by-one fashion. Consequently, 
totally perform 30 nominal executions and 180 failure executions. We induce the 
fault manually for each movement primitive, where including but no limited to the 
Human Collision (HC), Object Slip (OS), Tool Collision (TC), Wall Collision (WC), 
etc. 

We first evaluate the complex task segmentation from twenty whole kinesthetic 
demonstrations using Bayesian nonparametric methods. As shown in Fig. 6.7, illus- 
trating the complex task segmentation by learning the underlying dynamics using 
HDP-VAR-HMM, where each row indicates an independent demonstration and dif- 
ferent color represents a specific movement primitive, which no need to guarantee 
each demonstration have the same segmentation order. After then, we concentrate the 
ordered hidden state sequence and group them by hidden state transition pair t Pair, 
where the total number of pair is computed using the permutation combination algo- 
rithm, i.e. t Pairs = C2 A = 20, where K is the dimension of hidden space. Thus, 
the frequency among the possible transition of the learned five movement primitives 
is illustrated in Fig. 6.8, where the kitting experiment is definitely begin with move- 
ment 2 (clustered by the hidden state 2) and then the successor should be movement 
1, subsequent movement is 4 for most cases or movement 3 is the second-choice, 
and so on. Until now, we can effective learning the complex task representation for a 
set of nominal unstructured kinesthetic demonstration in a manipulation graph way. 
With this representation, we can evaluate the following the anomaly monitoring, 
diagnose as well as task exploration, respectively. 

According to Chaps.4 and 5, our proposed movement primitive equipped with 
two introspective capabilities: anomaly monitoring and diagnose, that necessary for 
endowing robot long-term autonomy and safer collaborative interaction in human- 
robot collaborative scenarios. Particularly, our anomaly monitoring and diagnose are 
implemented based on the Bayesian nonparametric models proposed in Chap. 2 that 
using the HDP-VAR-HMM with a first-order autoregressive Gaussian likelihood, 
each state k has two parameters to be defined: the regression coefficient matrix A, 
and the precision matrix A; as well as the four parameters v, A, V, M of the conjugate 
prior MNIW are assigned in advance. To guarantee the prior has a valid mean, the 
degrees-of-freedom variable is set as v = d + 2 and A is set by constraining the 
mean or expectation of the covariance matrix E[ A, '] under the Wishart prior in 
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Fig. 6.7 Complex task segmentation by learning the underlying dynamics using HDP-VAR-HMM. 
Each row indicates an independent demonstration and different color represents a specific movement 
primitive, which no need to guarantee each demonstration have the same segmentation order 


Fig. 6.8 Illustrates the transition frequency among the derived five movement primitives. The hid- 
den state sequences is first taken to extract the movement primitives from a set of demonstrations. 
A uncompleted manipulation sequence is ordered with 2 —> 1 — 0 — 4 — 3 based on the derived 
sequences. The numbers on the transitions correspond to the transition frequency for weakly rep- 
resenting the task at the beginning, and subsequently strengthening by incremental learning the 
introspective capabilities and exploration from practical applications 
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Eq. (4.30). 


N T 
Ar = sY Yoo (6.6) 


n=1 t=1 


Assume that we record N sequential data for each skill and the length of sequence 
n € N is T,. Thus, we can easily define the parameter A accordingly as 


A= (v—d-—E[A;’]. (6.7) 


We placed a nominal prior on the mean parameter with mean equal to the empirical 
mean and expected covariance equal to a scalar s times the empirical covariance, 
here sr = 1.0. This setting is motivated by the fact that the covariance is computed 
from polling all of the data and it tends to overestimate mode-specific co-variances. 
A value slightly less than or equal to 1 of the constant in the scale matrix mitigates 
the overestimation. Also, setting the prior from the data can move the distribution 
mass to reasonable parameter space values. The mean matrix M and V are set such 
that the mass of the Matrix-Normal distribution is centered around stable dynamic 
matrices while allowing variability in the matrix values (see Chap. 2 for details). 


M = 04, 
(6.8) 

V = 1.0 * lų. 
where 0, and Iy are the matrix of all zeros and the identity matrix, respectively, 
each of size d x d. For the concentration parameters, a Gammac(a, b) prior is set 
on HDP concentration parameters œ. A Beta(c, d) prior is set on the self-transition 
parameter u and the degree of self-transition bias «x is set to 50. We choose a weekly 
informative setting and choose: 


a = 0.5, b = 5.0, c = 1.0,d = 5.0 (6.9) 


where, the initial transition proportion parameter is defined as u ~ Beta(1, 10) and 
the Split-Merge Monte Carlo method sampling parameter maximum iterations is set 
to 1000. The truncation states number is set to K = 5 for anomaly monitoring, and 
K = 10 for fault diagnose. 

The anomaly monitoring is implemented by comparing the cumulative likeli- 
hood Y of observed observations, where the fault threshold is calculated from a set 
of Y,,i € {1, 2, ..., 20} nominal demonstrations, and formulated with the expected 
cumulative likelihood 4(#) minus the standard deviation o (£) that multiply by a 
constant value c. Additionally, the fault diagnose is activated when fault detected, 
which mainly implemented by comparing the sum of log-likelihood of a failure 
sample in a supervised fashion. Particularly, we use the K-Folders Cross Validation 
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Fig. 6.9 Visualizing the robot trajectories of reverse execution when accidental faults are detected. 
As shown, the five normal IMPs is derived by learning from unstructured demonstrations that 
successfully performing human-robot collaborative kitting task and the dark-blue trajectories are 
explored with the maximum probability in our defined reverse policy once the fault detected 


defined in sklearn * (here, K = 3) for model selection, in which the objective is the 
accuracy of anomaly monitoring with a fixed constant value c = 3. 

To achieve the incremental learning introspective movement primitive from 
unstructured demonstrations, another critical capability would be the task explo- 
ration under the unseen and unexpected situation, especially when robot encounter 
a failure event. There are two independent exploring policies: reverse execution and 
human interaction, are proposed for responding the external accidental and persistent 
faults, respectively. We test the whole performance in two scenarios for evaluating 
how to combine the anomaly monitoring, fault diagnose, and exploration in practi- 
cal applications along with the quantitative performance. As illustrated in Fig. 6.9, 
a set of collective 3D moving trajectories is presented for evaluating the exploring 
reverse execution when robot encounter accidental faults. Where the kitting task is 
first represented by five introspective movement primitives, and then the acciden- 
tal faults (marked in yellow dots) randomly happened during robot execution, the 
robot immediately perform the reverse execution (dark-blue in color) according the 
frequency distribution shown in Table 6.1. 

Additionally, As illustrated in Fig. 6.10, a set of collective 3D moving trajectories 
is presented for evaluating the exploring human interaction when robot encounter 
an wall collision that is a persistent fault. To evaluate the overall performance after 
human one-shot interaction, we get a 80.95% success rate when the wall collision 
induced by repeating the experiment 30 times. 


4https://scikit-learn.org. 
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Table 6.1 The frequency distribution of reverse execution policy in kitting experiment, where we 
record the data from five participants in total of 30 times of each accidental fault. Additionally, the 
2 is an union of robot performing Pre — pick — Pick and Pick — Pre — pick 


Fault diagnose results | Fault occurs Goal IMP Frequency 
Human collision 1 1 25 
2 2 30 
3 3 25 
4 4 25 
Tool collision 2 1 2 
3 3 20 
Object slip 2 2 5 
3 1 25 


IMP-1 

IMP-2 

IMP-3 

IMP-4 

IMP-5 
Exploration 
Fault detected 


Fig. 6.10 Visualizing the robot trajectories after human interaction when persistent faults are 
detected. As shown, the five normal IMPs is derived by learning from unstructured demonstrations 
that successfully performing human-robot collaborative kitting task and the dark-blue trajectories 
are incrementally explored by updated the demonstrated IMP parameters (including the start and 
goal pose, shape) once the fault detected 


6.7.3 Discussion 


In our previous work, we focus on the robot introspective capabilities in robotics, 
which only take the kinematic variables into consideration such that the task is 
restricted to be applied in human-robot collaborative scenarios that generalization 
(including motion, introspection, and decision-making, etc.) as a desirable charac- 
teristic. To address this problem, robot movement primitive augmented with intro- 
spective capacities IMPs is investigated in this paper, which by associating the gen- 
eralization, anomaly monitoring, fault diagnoses, and task exploration during robot 
manipulation task. We mainly introduce the IMPs can be acquired by assessing the 
quality of multimodal sensory data of unstructured demonstrations using a nonpara- 
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metric Bayesian model, named HDP-VAR-HMM, and IMPs can incremental learning 
the exploring policy using multimodal distribution and human one-shot modification 
when robot encounter fault. Particularly, reverse execution and human interaction are 
two independent policies for task exploration, which proposed to respond the external 
accidental and persistent fault, respectively. Experimental evaluation on a human- 
robot collaborative packaging task with a Rethink Baxter robot, results indicate that 
our proposed method can effectively increase robustness towards perturbations and 
adaptive exploration during human-robot collaborative manipulation task. We need 
to emphasize that our method presents a solution for endowing robot with introspec- 
tion in sense-plan-act control methodology for robot manipulation task. 

Recently, we are working on extending IMPs to be more human-like manipula- 
tion that including the visual and audial information for human robot collaborative 
electronic assembly task. We are also investigating the use of variational recurrent 
autoencoder neural network to facilitate our proposed framework for more complex 
scenarios. 


6.8 Summary 


In this chapter, we present two policies to deal with accidental and persistent anoma- 
lies respectively: movement reverse execution and human interaction. Reverse exe- 
cution allows the robot retry the current movement or several move-ments that inde- 
pendent with the current state for resolving the accidental faults by redo current or 
some previous movement primitives by updating the parameters f primitives (includ- 
ing shape, starting point and target point, etc.), and to use polynomial distribution to 
realize the modeling of a primitive after the occurrence of an anomaly occurrence. 
Human interaction allows the robot explore the current movement by human-assist 
demonstration for resolving the persistent faults that can’t be restored by reverse exe- 
cution, such as tool collision and wall collision, which mainly to learn the anomaly 
behavior from the human demonstration, and can realize the growth of the task repre- 
sentation of the recovery behavior. Importantly however we learned that the recovery 
ability of the system grows in difficulty with an increased number of adaptations as 
variations in sensory-motor signals increase as more recoveries are attempted. 

The proposed recovery policies are verified on a Baxter robot performs kitting 
experiment tasks, results indicate that the proposed policies meet the extensibility 
and adaptability for improving the long-term autonomy, an integrated to the SPAIR 
framework. Consequently, this book provides a efficient theoretical framework and 
software system for the implementation of the longer-term autonomy and a safer 
environment for human-robot interaction scenarios. Ultimately the system presented 
in this book significantly extended the autonomy and resilience of the robot and 
has broad applicability to all manipulation domains that suffer from uncertainties in 
unstructured environments: making industrial and service robots prime candidates 
for this technology. 
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In this book, anomalies are and will continue to be a reality in robotics despite 


increasingly powerful motion-generation algorithms, to address them explicitly, we 
presented a tightly-integrated, graph-based online motion-generation, introspection, 
and incremental recovery system for manipulation tasks in loosely structured co-bot 
scenarios, which consist of the movement identification, task representation, anomaly 
monitoring, anomaly diagnoses, and anomaly recovery. 
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